# Pandas

Exercise 1.1 :
- With the 4 datasets about wine quality and characteristics, rebuild a single dataset
- Display variables type and convert if need. Display also the dataset characteristics
- Define an index
- check if they are missing values and find a strategy to deal with this problem
- Remove the quality ranks which have less then 50 occurences
- Order the rows by quality rank
- Give some stats figures for each columns
- Display a correlation matrix in one graph
- Create a boxplot for each variable

Put all of this steps in a function allowing to filter on PH

# Numpy

Exercise 2.1 :
Create function allowing to compute for each variable (with numpy and the output table of the 1.1 exercise) :
- Min and Max
- D1 and D9
- Q1 and Q3
- Mean and Max
- Count the number of outliers according to the Tukey rule

Exercise 2.2 :
https://towardsdatascience.com/principal-component-analysis-pca-from-scratch-in-python-7f3e2a540c51
- Check for data distribution with graph
- Scale the data with the appropriate method
- Do a PCA analysis from scratch (Numpy) and choose the right number of components
- Print eigen values in a graph
- Print explained variance in a graph


# BeautifulSoup

Starting from the following webpage, extract the meteo data from 01/01/2022 to now :
(BONUS : Automatic columns extraction)
https://www.historique-meteo.net/france/provence-alpes-c-te-d-azur/marseille/2022/09/01/

In [22]:
import pandas as pd
from datetime import date, timedelta
sdate = date(2019,3,22)   # start date
edate = date(2019,4,9) 
a = list(pd.date_range(sdate,edate-timedelta(days=1),freq='d'))
str(a[0].date()).replace("-", "/")

'2019/03/22'

In [23]:
a

[Timestamp('2019-03-22 00:00:00', freq='D'),
 Timestamp('2019-03-23 00:00:00', freq='D'),
 Timestamp('2019-03-24 00:00:00', freq='D'),
 Timestamp('2019-03-25 00:00:00', freq='D'),
 Timestamp('2019-03-26 00:00:00', freq='D'),
 Timestamp('2019-03-27 00:00:00', freq='D'),
 Timestamp('2019-03-28 00:00:00', freq='D'),
 Timestamp('2019-03-29 00:00:00', freq='D'),
 Timestamp('2019-03-30 00:00:00', freq='D'),
 Timestamp('2019-03-31 00:00:00', freq='D'),
 Timestamp('2019-04-01 00:00:00', freq='D'),
 Timestamp('2019-04-02 00:00:00', freq='D'),
 Timestamp('2019-04-03 00:00:00', freq='D'),
 Timestamp('2019-04-04 00:00:00', freq='D'),
 Timestamp('2019-04-05 00:00:00', freq='D'),
 Timestamp('2019-04-06 00:00:00', freq='D'),
 Timestamp('2019-04-07 00:00:00', freq='D'),
 Timestamp('2019-04-08 00:00:00', freq='D')]

# Sklearn

Exercise 4.1 :

Take the ML example of the course and try to improve the model accuracy using Sklearn :
  - Add preprocessing steps
  - Adding columns
  - Test other models
  - Test other fine tuning methods

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn import model_selection
from sklearn.utils import class_weight
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Import dataset
train_df = pd.read_csv("data/train.csv")
train_df = train_df.set_index("PassengerId")
test_df = pd.read_csv("data/test.csv")
test_df = test_df.set_index("PassengerId")

# Train test split
X_train, X_val, y_train, y_val = train_test_split(train_df.drop("Survived", axis=1),
                                              train_df["Survived"], test_size=0.33, random_state=42)

##### Train set preparation #####
# Select column
X_train = X_train[["Pclass", "Sex","Age", "SibSp", "Parch", "Fare", "Embarked"]]

# Preprocess - Catagorical
encoder = OneHotEncoder()
encoder.fit(X_train[["Pclass", "Sex", "Embarked"]])
X_train_enc = encoder.transform(X_train[["Pclass", "Sex", "Embarked"]])
ohe_df = pd.DataFrame(data = X_train_enc.toarray(),
             columns=encoder.get_feature_names(["Pclass", "Sex", "Embarked"]), 
             index = X_train.index)

#Preprocess Numerical.
scaler = MinMaxScaler()
scaler.fit(X_train[["Age", "SibSp", "Parch", "Fare"]])
X_train_scale = scaler.transform(X_train[["Age", "SibSp", "Parch", "Fare"]])
scaled_df = pd.DataFrame(data = X_train_scale, 
                         columns=["Age", "SibSp", "Parch", "Fare"],
                         index = X_train.index)

# Concatenation
train_df_prep = pd.concat([scaled_df, ohe_df], axis=1)
train_df_prep = train_df_prep.drop("Embarked_nan", axis=1)


##### Validation set preparation #####
# Change index and select columns
X_val = X_val[["Pclass", "Sex","Age", "SibSp", "Parch", "Fare", "Embarked"]]

# Apply encoder and scaler
X_val_enc = encoder.transform(X_val[["Pclass", "Sex", "Embarked"]])
val_ohe_df = pd.DataFrame(data = X_val_enc.toarray(),
             columns=encoder.get_feature_names(["Pclass", "Sex", "Embarked"]), 
             index = X_val.index)

X_val_scale = scaler.transform(X_val[["Age", "SibSp", "Parch", "Fare"]])
val_scaled_df = pd.DataFrame(data = X_val_scale, 
                         columns=["Age", "SibSp", "Parch", "Fare"],
                         index = X_val.index)

# Concatenation
val_df_prep = pd.concat([val_scaled_df, val_ohe_df], axis=1)
val_df_prep = val_df_prep.drop("Embarked_nan", axis=1)


##### test set preparation #####
# Change index and select columns
test_df = test_df[["Pclass", "Sex","Age", "SibSp", "Parch", "Fare", "Embarked"]]

# Apply encoder and scaler
test_df_enc = encoder.transform(test_df[["Pclass", "Sex", "Embarked"]])
test_ohe_df = pd.DataFrame(data = test_df_enc.toarray(),
             columns=encoder.get_feature_names(["Pclass", "Sex", "Embarked"]), 
             index = test_df.index)

test_df_scale = scaler.transform(test_df[["Age", "SibSp", "Parch", "Fare"]])
test_scaled_df = pd.DataFrame(data = test_df_scale, 
                         columns=["Age", "SibSp", "Parch", "Fare"],
                         index = test_df.index)

# Concatenation
test_df_prep = pd.concat([test_scaled_df, test_ohe_df], axis=1)
test_df_prep = test_df_prep.drop("Embarked_nan", axis=1)


##### Deal with measing values #####
# train set
train_df_prep = train_df_prep.dropna()
y_train = y_train.filter(items = list(train_df_prep.index))

# validation set
val_df_prep = val_df_prep.dropna()
y_val = y_val.filter(items = list(val_df_prep.index))

# test set
test_df_prep = test_df_prep.dropna()

##### Test several models #####
accuracy = []
models = [
          ('RF', RandomForestClassifier()),
          ('KNN', KNeighborsClassifier()),
          ('XGB', XGBClassifier())
        ]
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc']

for name, model in models:
    kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=42)
    cv_results = model_selection.cross_validate(model, train_df_prep, y_train, cv=kfold, scoring=scoring)
    clf = model.fit(train_df_prep, y_train)
    y_pred = clf.predict(val_df_prep)
    print(name)
    print(classification_report(y_val, y_pred))
    
    
##### Grid search #####
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

rfc=RandomForestClassifier(random_state=42)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(train_df_prep, y_train)
y_pred_gscv = CV_rfc.predict(X_val)
print(CV_rfc.best_params_)
print(classification_report(y_val, y_pred_gscv))



RF
              precision    recall  f1-score   support

           0       0.79      0.80      0.80       138
           1       0.72      0.69      0.70        98

    accuracy                           0.76       236
   macro avg       0.75      0.75      0.75       236
weighted avg       0.76      0.76      0.76       236

KNN
              precision    recall  f1-score   support

           0       0.77      0.85      0.81       138
           1       0.75      0.65      0.70        98

    accuracy                           0.77       236
   macro avg       0.76      0.75      0.75       236
weighted avg       0.77      0.77      0.76       236

XGB
              precision    recall  f1-score   support

           0       0.82      0.80      0.81       138
           1       0.73      0.74      0.74        98

    accuracy                           0.78       236
   macro avg       0.77      0.77      0.77       236
weighted avg       0.78      0.78      0.78       236



  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
Feature names unseen at fit time:
- Embarked
- Pclass
- Sex
Feature names seen at fit time, yet now missing:
- Embarked_C
- Embarked_Q
- Embarked_S
- Pclass_1
- Pclass_2
- ...



ValueError: could not convert string to float: 'male'

Exercise 4.2

Using the 2.2 exercise and the eigen vector :
- Do clustering of the wines with Sklearn (use the documentation).We will use the DB Scan method.
- Compute a spyder /radar graph with 6 variables of your choice (one graph by cluster, you can use an input variable to select the cluster to be ploted)



# Keras / Tensorflow

Exercise 5.1 

Reproduce the methology of the MNIST dataset to have the best performance possible for the fashion MNIST dataset !
Let's start the competition !

In [2]:
import os
import gzip
import numpy as np
def load_mnist(path, kind='train'):

    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,
                               '%s-labels-idx1-ubyte.gz'
                               % kind)
    images_path = os.path.join(path,
                               '%s-images-idx3-ubyte.gz'
                               % kind)

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)

    return images, labels

In [3]:
X_train, y_train = load_mnist('data/fashion', kind='train')
X_test, y_test = load_mnist('data/fashion', kind='t10k')

# Regex

Exercise 6.1

Extract the following information : 
  - Plane model
  - The phone number
  - The Master track
  - The country based of the phone number
  - The name and surname
  - The Age

In [15]:
text_airbus = "At Airbus, we build A321, A400M and A380"

text_phone = "Hey I just met you, this is crazy, here is my number, so call me maybe : 06 10 30 74 21"

text_master = "At AMSE, you could join several Masters : M2 MBSE, M2 APE, M2 Finance"

text_country = "You can join me at at +33(0)761850594"

text_name = "Julie DUPONT and Marion MARTIN are my best friends !"

text_age = "Julie is 23 years old, and Marion is 25 years old"

# NLTK

Exercise 7.1

Open the Covid19 tweets dataset and then :
   - Clean the user description column
   - Clean the text column
   - Do a word cloud for each of the columns
   - Try to find links between between the user description its posted text