# 011-02 - Python Basics - Solution Notebook

* Written by Alexandre Gazagnes
* Last update: 2024-02-01

# About

### Using Jupyter

You have 2 options: 
- Locally: 

    - **Install Anaconda https://www.anaconda.com/ or Jupyter https://jupyter.org/install on your machine**


- Online:

    - **Use Google Colab https://colab.research.google.com/** (you have to be connected to your google account)

### Material

All the material for this course could be found here.
- https://github.com/AlexandreGazagnes/CentraleSupElec-NLP-Public-Ressources

### Python / Jupyter ? 

Few Questions : 
- Why Python
- Python vs R ? 
- What is Data Analysis ? 
- What are we talking about ? 
- What is Jupyter ?

### Context

This notebook is about some basic core features of python programming language such as string, function, dataframes etc etc 

### Usefull Ressources about Google Colab


- On Youtube : 
    - https://www.youtube.com/watch?v=8KeJZBZGtYo
    - https://www.youtube.com/watch?v=JJYZ3OE_lGo
    - https://www.youtube.com/watch?v=tCVXoTV12dE

### Usefull Ressources about Anaconda and Jupyter


- On Youtube : 
    - https://www.youtube.com/watch?v=ovlID7gefzE
    - https://www.youtube.com/watch?v=IMrxB8Mq5KU
    - https://www.youtube.com/watch?v=Ou-7G9VQugg
    - https://www.youtube.com/watch?v=5pf0_bpNbkw

### Usefull Ressources about Git and GitHub


- On Youtube : 
    * https://www.youtube.com/watch?v=RGOj5yH7evk
    * https://www.youtube.com/watch?v=3RjQznt-8kE&list=PL4cUxeGkcC9goXbgTDQ0n_4TBzOO0ocPR


### Teacher 

- More info : 
    - https://www.linkedin.com/in/alexandregazagnes/
    - https://github.com/AlexandreGazagnes
    

## Preliminaries

### System

These commands will display the system information:

Uncomment theses lines if needed. 

In [1]:
# pwd

In [2]:
# cd ..

In [3]:
# ls

These commands will install the required packages:

**Please note that if you are using google colab, all you need is already installed**

In [6]:
# !pip install pandas matplotlib seaborn plotly scikit-learn

### Imports

Import strings libraries (Built-In) :

In [7]:
import string

# import secrets

Import data libraries:

In [8]:
import pandas as pd  # DataFrame

# import numpy as np      # Matrix and advanced maths operations

Import Graphical libraries:

In [9]:
import matplotlib.pyplot as plt  # Visualisation
import seaborn as sns  # Visualisation

# import plotly.express as px   # Visualisation (not used here)

# Import Ml Librairies

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin  # Machine Learning

:warning:**These imports must be done, it is not possible to use this notebook without pandas, matplotlib etc.**

##  Basics of Python

### Strings 

A simple String : 

In [11]:
text = "Hello World"

Output : 

In [12]:
text

'Hello World'

Print : 

In [13]:
print(text)

Hello World


Type of ```text``` : 

In [14]:
type(text)

str

Specific string methods : 

Lower : 

In [15]:
text.lower()

'hello world'

Upper : 

In [16]:
text.upper()

'HELLO WORLD'

Strip : 

In [17]:
text = "    Hello World     "
text

'    Hello World     '

In [18]:
text.strip()

'Hello World'

Is Alpha :

In [19]:
text.isalpha()

False

In [20]:
text = "hello"
text.isalpha()

True

Length : 

In [21]:
len(text)

5

'o' is in text ? 

In [22]:
"o" in text

True

Please use thins link if needed:
* https://www.w3schools.com/python/python_strings.asp 
* https://www.geeksforgeeks.org/python-string/


Sort the text : 

In [23]:
sorted(text)

['e', 'h', 'l', 'l', 'o']

Use a basic for loop for filtering :

In [24]:
new_txt = ""
for i in text:
    if i != "o":
        new_txt = new_txt + i
        # new_txt+=i

In [25]:
new_txt

'hell'

Basic list comprehension : 

In [26]:
[i for i in text]

['h', 'e', 'l', 'l', 'o']

Use list comprehension for filtering : 

In [27]:
[i for i in text if "l" != i]

['h', 'e', 'o']

Transform ```text``` in list : 

In [28]:
list(text)

['h', 'e', 'l', 'l', 'o']

Concatenate 2 strings : 

In [29]:
txt1 = "hello"
txt2 = "world"

txt = txt1 + txt2
txt

'helloworld'

Better : 

In [30]:
txt = f"{txt1} {txt2}"
print(txt)

hello world


Split : 

In [31]:
txt.split(" ")

['hello', 'world']

Use the ```\n``` to add beak lines : 

In [32]:
txt = "\n\n\n Hello World\n\n\n"
txt

'\n\n\n Hello World\n\n\n'

With print : 

In [33]:
print(txt)




 Hello World





Another option : 

In [34]:
text = """

Hello
World 
!


"""

text

'\n\nHello\nWorld \n!\n\n\n'

With print : 

In [35]:
print(text)



Hello
World 
!





Indexing :

In [36]:
text = "hello world"
text[0]

'h'

In [37]:
text[2]

'l'

In [38]:
text[-1]

'd'

Slicing :

In [39]:
text[0:3]

'hel'

In [40]:
text[:3]

'hel'

In [41]:
text[2:4]

'll'

In [42]:
text[-2:]

'ld'

Usefull builtin library : 

In [43]:
string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [44]:
string.ascii_uppercase

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [45]:
punct = string.punctuation
punct

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

### Function

Let's create a simple function to clean our ```text``` variable : 

In [46]:
text = """

Hello


World 


! 
"""

In [47]:
def clean(txt):
    """A very simple function"""

    txt = txt.lower()
    txt = txt.split()
    txt = [i.strip() for i in txt if i]

    return txt

In [48]:
cleaned_text = clean(text)
cleaned_text

['hello', 'world', '!']

We can add some optional arguments : 

In [49]:
def clean(txt, lower=True):
    """No so simple function"""

    if lower:
        txt = txt.lower()

    txt = txt.replace("\n", " ")

    txt = txt.split(" ")

    txt = [i.strip() for i in txt if i]

    return txt

In [50]:
clean(text)

['hello', 'world', '!']

And build a much better function : 

In [51]:
def clean(
    txt,
    return_type: str,
    lower: bool = True,
    remove_punct: bool = True,
    remove_small_words: bool = True,
    small_word_n_char: int = 3,
):
    """More complex function"""

    # check if return_type is OK
    if not return_type in ["str", "list"]:
        raise AttributeError(
            f"return_type is not good : recieved {return_type}, expected in ['str', 'list]"
        )

    # if lower, apply the lower method
    if lower:
        txt = txt.lower()

    # remove breaklines
    txt = txt.replace("\n", " ")

    # if remove_punct, remove punctuation
    if remove_punct:
        for c in string.punctuation:
            txt = txt.replace(c, "")

    # split
    txt = txt.split(" ")

    # strip
    txt = [i.strip() for i in txt if i]

    # remove_small_words if needed
    if remove_small_words:
        txt = [i for i in txt if len(i) > small_word_n_char]

    # manage the return type
    if return_type == "list":
        return txt
    elif return_type == "str":
        return " ".join(txt)
    else:
        return -1

In [52]:
clean(text, return_type="list")

['hello', 'world']

In [53]:
clean(text, return_type="str")

'hello world'

### Transformers

You are not supposed to be familiar with custom transformers, but take 5 minutes to read this piece of code : 

In [54]:
class StringCleaner(BaseEstimator, TransformerMixin):

    def __init__(
        self,
        return_type: str,
        lower: bool = True,
        remove_punct: bool = True,
        remove_small_words: bool = True,
        small_word_n_char: int = 3,
    ):

        # check if return_type is OK
        if not return_type in ["str", "list"]:
            raise AttributeError(
                f"return_type is not good : recieved {return_type}, expected in ['str', 'list]"
            )

        self.return_type = return_type
        self.lower = lower
        self.remove_punct = remove_punct
        self.remove_small_words = remove_small_words
        self.small_word_n_char = small_word_n_char

    def fit(self, txt, y=None):
        return self

    def transform(self, txt, y=None):

        # if lower, apply the lower method
        if self.lower:
            txt = txt.lower()

        # if remove_punct, remove punctuation
        if self.remove_punct:
            for c in string.punctuation:
                txt = txt.replace(c, "")

        # remove break lines
        txt = txt.replace("\n", " ")

        # split
        txt = txt.split(" ")

        # strip
        txt = [i.strip() for i in txt if i]

        # remove_small_words if needed
        if self.remove_small_words:
            txt = [i for i in txt if len(i) > self.small_word_n_char]

        # manage the return type
        if self.return_type == "list":
            return txt
        elif self.return_type == "str":
            return " ".join(txt)
        else:
            return -1

Let's create a ```text``` variable : 

In [55]:
text = "\n\nHello my FRIeND !!! "

And let's use our custom transformer : 

In [56]:
transformer = StringCleaner(return_type="list")
transformer.fit(text)
new_text = transformer.transform(text)
new_text

['hello', 'friend']

### DataFrame

In order to create a dataframe, we can create a list of dictionnaries : 

In [57]:
data = [
    {"_id": 0, "text": "My cat is Blue"},
    {"_id": 1, "text": "My cat is Red"},
    {"_id": 2, "text": "My cat is Dark. A very intense and beautiful dark "},
]

data

[{'_id': 0, 'text': 'My cat is Blue'},
 {'_id': 1, 'text': 'My cat is Red'},
 {'_id': 2, 'text': 'My cat is Dark. A very intense and beautiful dark '}]

Then we can transform our dictionnaries in vectors inside a dataframe : 

In [58]:
df = pd.DataFrame(data)

df

Unnamed: 0,_id,text
0,0,My cat is Blue
1,1,My cat is Red
2,2,My cat is Dark. A very intense and beautiful d...


Type of df : 

In [59]:
type(df)

pandas.core.frame.DataFrame

Let's create a new column : 

In [60]:
df["_len"] = df.text.apply(lambda i: len(i))

df

Unnamed: 0,_id,text,_len
0,0,My cat is Blue,14
1,1,My cat is Red,13
2,2,My cat is Dark. A very intense and beautiful d...,50


Selecting a column : 

In [61]:
df.text

0                                       My cat is Blue
1                                        My cat is Red
2    My cat is Dark. A very intense and beautiful d...
Name: text, dtype: object

or : 

In [62]:
df.loc[:, "text"]

0                                       My cat is Blue
1                                        My cat is Red
2    My cat is Dark. A very intense and beautiful d...
Name: text, dtype: object

or :

In [63]:
df.iloc[:, 1]

0                                       My cat is Blue
1                                        My cat is Red
2    My cat is Dark. A very intense and beautiful d...
Name: text, dtype: object

Selecting a row

In [64]:
df.iloc[0]

_id                  0
text    My cat is Blue
_len                14
Name: 0, dtype: object

or : 

In [65]:
df.loc[0, :]

_id                  0
text    My cat is Blue
_len                14
Name: 0, dtype: object

Our df : 

In [66]:
df

Unnamed: 0,_id,text,_len
0,0,My cat is Blue,14
1,1,My cat is Red,13
2,2,My cat is Dark. A very intense and beautiful d...,50


Selecting specific values in a dataframe : 

In [67]:
df.loc[df._len > 15, :]

Unnamed: 0,_id,text,_len
2,2,My cat is Dark. A very intense and beautiful d...,50


Describe numeric columns : 

In [68]:
df.describe(include="number").round(2)

Unnamed: 0,_id,_len
count,3.0,3.0
mean,1.0,25.67
std,1.0,21.08
min,0.0,13.0
25%,0.5,13.5
50%,1.0,14.0
75%,1.5,32.0
max,2.0,50.0


Describe non numeric columns : 

In [69]:
df.describe(exclude="number").round(2)

Unnamed: 0,text
count,3
unique,3
top,My cat is Blue
freq,1
