# 011-02 - Python Basics - Exercice Notebook

* Written by Alexandre Gazagnes
* Last update: 2024-02-01

# About

### Using Jupyter

You have 2 options: 
- Locally: 

    - **Install Anaconda https://www.anaconda.com/ or Jupyter https://jupyter.org/install on your machine**


- Online:

    - **Use Google Colab https://colab.research.google.com/** (you have to be connected to your google account)

### Material

All the material for this course could be found here.
- https://github.com/AlexandreGazagnes/CentraleSupElec-NLP-Public-Ressources

### Python / Jupyter ? 

Few Questions : 
- Why Python
- Python vs R ? 
- What is Data Analysis ? 
- What are we talking about ? 
- What is Jupyter ?

### Context

This notebook is about some basic core features of python programming language such as string, function, dataframes etc etc 

### Usefull Ressources about Google Colab


- On Youtube : 
    - https://www.youtube.com/watch?v=8KeJZBZGtYo
    - https://www.youtube.com/watch?v=JJYZ3OE_lGo
    - https://www.youtube.com/watch?v=tCVXoTV12dE

### Usefull Ressources about Anaconda and Jupyter


- On Youtube : 
    - https://www.youtube.com/watch?v=ovlID7gefzE
    - https://www.youtube.com/watch?v=IMrxB8Mq5KU
    - https://www.youtube.com/watch?v=Ou-7G9VQugg
    - https://www.youtube.com/watch?v=5pf0_bpNbkw

### Usefull Ressources about Git and GitHub


- On Youtube : 
    * https://www.youtube.com/watch?v=RGOj5yH7evk
    * https://www.youtube.com/watch?v=3RjQznt-8kE&list=PL4cUxeGkcC9goXbgTDQ0n_4TBzOO0ocPR


### Teacher 

- More info : 
    - https://www.linkedin.com/in/alexandregazagnes/
    - https://github.com/AlexandreGazagnes
    

## Preliminaries

### System

These commands will display the system information:

Uncomment theses lines if needed. 

In [1]:
# pwd

In [2]:
# cd ..

In [3]:
# ls

In [4]:
# cd ..

In [5]:
# ls

These commands will install the required packages:

**Please note that if you are using google colab, all you need is already installed**

In [6]:
# !pip install pandas matplotlib seaborn plotly scikit-learn

### Imports

Import strings libraries (Built-In) :

In [7]:
import string

# import secrets

Import data libraries:

In [8]:
import pandas as pd  # DataFrame

# import numpy as np      # Matrix and advanced maths operations

Import Graphical libraries:

In [9]:
import matplotlib.pyplot as plt  # Visualisation
import seaborn as sns  # Visualisation

# import plotly.express as px   # Visualisation (not used here)

# Import Ml Librairies

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin  # Machine Learning

:warning:**These imports must be done, it is not possible to use this notebook without pandas, matplotlib etc.**

##  Basics of Python

### Strings 

A simple String : 

In [11]:
text = "Hello World"

Output : 

In [12]:
text

'Hello World'

Print : 

Type of ```text``` : 

Specific string methods : 

Lower : 

Upper : 

Strip : 

In [13]:
text = "    Hello World     "
text

'    Hello World     '

Is Alpha :

In [14]:
text = "hello"

Length : 

'o' is in text ? 

Please use thins link if needed:
* https://www.w3schools.com/python/python_strings.asp 
* https://www.geeksforgeeks.org/python-string/


Sort the text : 

Use a basic for loop for filtering :

In [15]:
new_txt = ""
for i in text:
    # Some code here ?
    pass

In [16]:
new_txt

''

Basic list comprehension : 

Use list comprehension for filtering : 

Transform ```text``` in list : 

Concatenate 2 strings : 

In [17]:
txt1 = "hello"
txt2 = "world"

Better : 

Split : 

Use the ```\n``` to add beak lines : 

In [18]:
txt = "\n\n\n Hello World\n\n\n"
txt

'\n\n\n Hello World\n\n\n'

With print : 

Another option : 

In [19]:
text = """

Hello
World 
!


"""

text

'\n\nHello\nWorld \n!\n\n\n'

With print : 

Indexing :

In [20]:
text = "hello world"
# 0

In [21]:
# 2

In [22]:
# -1

Slicing :

In [23]:
# 0:3

In [24]:
# :3

In [25]:
# 2:4

In [26]:
# -2

Usefull builtin library : 

### Function

Let's create a simple function to clean our ```text``` variable : 

In [27]:
text = """

Hello


World 


! 
"""

In [28]:
def clean(txt):
    """A very simple function"""

    txt = txt.lower()
    txt = txt.split()
    txt = [i.strip() for i in txt if i]

    return txt

We can add some optional arguments : 

In [29]:
def clean(txt, lower=True):
    """No so simple function"""

    if lower:
        txt = txt.lower()

    txt = txt.replace("\n", " ")

    txt = txt.split(" ")

    txt = [i.strip() for i in txt if i]

    return txt

And build a much better function : 

In [30]:
def clean(
    txt,
    return_type: str,
    lower: bool = True,
    remove_punct: bool = True,
    remove_small_words: bool = True,
    small_word_n_char: int = 3,
):
    """More complex function"""

    # check if return_type is OK
    if not return_type in ["str", "list"]:
        raise AttributeError(
            f"return_type is not good : recieved {return_type}, expected in ['str', 'list]"
        )

    # if lower, apply the lower method
    if lower:
        txt = txt.lower()

    # remove breaklines
    txt = txt.replace("\n", " ")

    # if remove_punct, remove punctuation
    if remove_punct:
        for c in string.punctuation:
            txt = txt.replace(c, "")

    # split
    txt = txt.split(" ")

    # strip
    txt = [i.strip() for i in txt if i]

    # remove_small_words if needed
    if remove_small_words:
        txt = [i for i in txt if len(i) > small_word_n_char]

    # manage the return type
    if return_type == "list":
        return txt
    elif return_type == "str":
        return " ".join(txt)
    else:
        return -1

### Transformers

You are not supposed to be familiar with custom transformers, but take 5 minutes to read this piece of code : 

In [31]:
class StringCleaner(BaseEstimator, TransformerMixin):

    def __init__(
        self,
        return_type: str,
        lower: bool = True,
        remove_punct: bool = True,
        remove_small_words: bool = True,
        small_word_n_char: int = 3,
    ):

        # check if return_type is OK
        if not return_type in ["str", "list"]:
            raise AttributeError(
                f"return_type is not good : recieved {return_type}, expected in ['str', 'list]"
            )

        self.return_type = return_type
        self.lower = lower
        self.remove_punct = remove_punct
        self.remove_small_words = remove_small_words
        self.small_word_n_char = small_word_n_char

    def fit(self, txt, y=None):
        return self

    def transform(self, txt, y=None):

        # if lower, apply the lower method
        if self.lower:
            txt = txt.lower()

        # if remove_punct, remove punctuation
        if self.remove_punct:
            for c in string.punctuation:
                txt = txt.replace(c, "")

        # remove break lines
        txt = txt.replace("\n", " ")

        # split
        txt = txt.split(" ")

        # strip
        txt = [i.strip() for i in txt if i]

        # remove_small_words if needed
        if self.remove_small_words:
            txt = [i for i in txt if len(i) > self.small_word_n_char]

        # manage the return type
        if self.return_type == "list":
            return txt
        elif self.return_type == "str":
            return " ".join(txt)
        else:
            return -1

Let's create a ```text``` variable : 

In [32]:
text = "\n\nHello my FRIeND !!! "

And let's use our custom transformer : 

### DataFrame

In order to create a dataframe, we can create a list of dictionnaries : 

In [33]:
data = [
    {"_id": 0, "text": "My cat is Blue"},
    {"_id": 1, "text": "My cat is Red"},
    {"_id": 2, "text": "My cat is Dark. A very intense and beautiful dark "},
]

data

[{'_id': 0, 'text': 'My cat is Blue'},
 {'_id': 1, 'text': 'My cat is Red'},
 {'_id': 2, 'text': 'My cat is Dark. A very intense and beautiful dark '}]

Then we can transform our dictionnaries in vectors inside a dataframe : 

Type of df : 

Let's create a new column : 

Selecting a column : 

In [34]:
df.text

NameError: name 'df' is not defined

or : 

: 

or :

: 

Selecting a row

: 

or : 

: 

Our df : 

In [None]:
df

: 

Selecting specific values in a dataframe : 

: 

Describe numeric columns : 

: 

Describe non numeric columns : 

: 