# Brief projet
### Etape 1 : Choisir un secteur qui vous interesse.
### Etape 2 : Trouver une problematique - identifier une target.
### Etape 3 : Sélectionner une ou des bases de données (SQL, CSV, API, scraping, etc...).
### Etape 4 : Réaliser l’analyse de donnée.
### Etape 5 : Réaliser un model de machine de learning (regression lineaire)
### Rendu : Présentation Oral d’un Notebook propre, légé et bien structuré (legend et titre sur les graphiques, abscisse et ordonnée. Faire des parties dans le notebook.).

Optionnel : Architecture du projet en POO, RandomizeSearch, GridSearch, Learning curve.

- Outils à utiliser :
- Analyse : Notebooke, Numpy, Pandas, Matplotlib ou Seaborn. Sklearn ou Stat model.
- Sklearn : RandomizeSearch, GridSearch, Cross validation, Train/Test Split, model de Regression Lineaire Pipeline.
- Gestion de Projet Agile: Github, Trello (ou autre outil de gestion de projet : Jira, Clickup, Teams, etc...).

In [186]:
import seaborn as sns
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# Get Dataframe, features and target

In [187]:

df = sns.load_dataset("tips")
X = df.drop(columns=['tip'])
y = df['tip']

# Select the numeric columns
numeric_features = X.select_dtypes(include=['float', 'int']).columns

# Select the categorical columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns

In [188]:
X.dtypes

total_bill     float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object

# Define the list of regression models, scoring, strategies

In [189]:
models = [LinearRegression(), Ridge(), Lasso(), RandomForestRegressor()]

scorings = ['r2', 'neg_mean_absolute_error', 'neg_mean_squared_error']

strategies = ['mean', 'median', 'most_frequent', 'constant']

# Create preprocesses for each strategies

In [190]:
preprocessings = {}
for strategie in strategies:
    preprocessing = ColumnTransformer(
        [
        ('imputer', SimpleImputer(strategy=strategie), numeric_features),
        ('scaler', StandardScaler(), numeric_features),
        ('onehot', OneHotEncoder(), categorical_features)
        ]
    )
    preprocessings[strategie] = preprocessing
    print(f'Strategie: {strategie}')


Strategie: mean
Strategie: median
Strategie: most_frequent
Strategie: constant


# Loop and score for each strategie, scoring method, model used

In [191]:
def get_pipelines_dataframe(X, y, models, scorings, preprocessings, cvs):
    df_ret = pd.DataFrame(columns=['strategie', 'model', 'cv', 'scoring', 'mean', 'std', 'pipeline'])
    for strategie in preprocessings:
        for scoring in scorings:
            for model in models:
                for cv in cvs:
                    pipeline = Pipeline([
                        ('preprocessing', preprocessings[strategie]),
                        ('model', model)
                    ])

                    # Use cross_val_score to evaluate the model using 10-fold cross-validation
                    scores = cross_val_score(pipeline, X, y, scoring=scoring, cv=cv)
                    df_ret.loc[len(df_ret)] = [strategie, model, cv, scoring, scores.mean(), scores.std(), pipeline]
    return df_ret
df_tests = get_pipelines_dataframe(X,y,models,scorings,preprocessings, [5,10])
df_tests

Unnamed: 0,strategie,model,cv,scoring,mean,std,pipeline
0,mean,LinearRegression(),5,r2,0.421438,0.135329,"(ColumnTransformer(transformers=[('imputer', S..."
1,mean,LinearRegression(),10,r2,0.322851,0.424865,"(ColumnTransformer(transformers=[('imputer', S..."
2,mean,Ridge(),5,r2,0.422349,0.134981,"(ColumnTransformer(transformers=[('imputer', S..."
3,mean,Ridge(),10,r2,0.325872,0.425150,"(ColumnTransformer(transformers=[('imputer', S..."
4,mean,Lasso(),5,r2,0.449595,0.111509,"(ColumnTransformer(transformers=[('imputer', S..."
...,...,...,...,...,...,...,...
91,constant,Ridge(),10,neg_mean_squared_error,-1.149587,0.655351,"(ColumnTransformer(transformers=[('imputer', S..."
92,constant,Lasso(),5,neg_mean_squared_error,-1.069100,0.415423,"(ColumnTransformer(transformers=[('imputer', S..."
93,constant,Lasso(),10,neg_mean_squared_error,-1.102236,0.589624,"(ColumnTransformer(transformers=[('imputer', S..."
94,constant,RandomForestRegressor(),5,neg_mean_squared_error,-1.206624,0.435431,"(ColumnTransformer(transformers=[('imputer', S..."


## Sorting results by mean of scores

In [198]:
df_tests.nlargest(20, 'mean')

Unnamed: 0,strategie,model,cv,scoring,mean,std,pipeline
4,mean,Lasso(),5,r2,0.449595,0.111509,"(ColumnTransformer(transformers=[('imputer', S..."
28,median,Lasso(),5,r2,0.449595,0.111509,"(ColumnTransformer(transformers=[('imputer', S..."
52,most_frequent,Lasso(),5,r2,0.449595,0.111509,"(ColumnTransformer(transformers=[('imputer',\n..."
76,constant,Lasso(),5,r2,0.449595,0.111509,"(ColumnTransformer(transformers=[('imputer', S..."
2,mean,Ridge(),5,r2,0.422349,0.134981,"(ColumnTransformer(transformers=[('imputer', S..."
26,median,Ridge(),5,r2,0.422349,0.134981,"(ColumnTransformer(transformers=[('imputer', S..."
50,most_frequent,Ridge(),5,r2,0.422349,0.134981,"(ColumnTransformer(transformers=[('imputer',\n..."
74,constant,Ridge(),5,r2,0.422349,0.134981,"(ColumnTransformer(transformers=[('imputer', S..."
0,mean,LinearRegression(),5,r2,0.421438,0.135329,"(ColumnTransformer(transformers=[('imputer', S..."
24,median,LinearRegression(),5,r2,0.421438,0.135329,"(ColumnTransformer(transformers=[('imputer', S..."


In [193]:
best_pipeline = df_tests.nlargest(1, 'mean')['pipeline'].tolist()[0]
best_pipeline

# Pickle the best Pipeline

In [210]:
import pickle
pickle.dump(best_pipeline, open('pipeline.pkl', 'wb'))

In [213]:
pkl = pickle.load(open('pipeline.pkl', 'rb'))
pkl.predict('k')

ValueError: Expected 2D array, got scalar array instead:
array=k.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.