<center><h1> Projet Kickstarter</h1></center>

## _Prédiction du succès d'une compagne de financenement sur KickStarter_
- Partie 1 : Analyse exploratoire des données
- Partie 2 : Processing des données
- Partie 3 : Enrichissement du dataset : WebScraping
- Partie 4 : Analyse statistique et Data Visualisation
- **Partie 5 : Machine Learning**

<hr/>


## Partie 5 : Machine Learning

- A partir de `sklearn` importer les librairires : `svm`, `linear_model`, `preprocessing`, `model_selection` et `neighbors`
- A partir de `sklean.tree` importer `DecisionTreeClassifier`
- Importer le dataset `'dataset_kickstarter_full.csv'`

In [1]:
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import model_selection
from sklearn import neighbors
from sklearn.tree import DecisionTreeClassifier
from sklearn import ensemble

import warnings
warnings.filterwarnings("ignore")

data = pd.read_csv('dataset_kickstarter_full.csv', index_col=0)

### Préparation des données

In [2]:
target = data['success']
features = data[['annee',
                 'proj_name', 
                 'proj_desc_len', 
                 'country', 'goal', 
                 'duree_projet', 
                 'coup_de_coeur', 
                 'cat_id', 
                 'cat_prim', 
                 'cat_name', 
                 'crea_id', 
                 'crea_name']]

longueur=[]
    
for desc in features['proj_name']:
    if not isinstance(desc, str):
        longueur.append(0)
    else:
         #print(desc)
        longueur.append(len(desc))
    
features['proj_name_len']=longueur

features = features.join(pd.get_dummies(features.cat_prim,
                                        prefix='cat'),
                         how='right')
features = features.join(pd.get_dummies(features.cat_name,
                                        prefix='sous_cat'),
                         how='right')
#features['spot_light'] = features['spot_light'].replace(to_replace=[True, False], value=[1,0])
features = features.join(pd.get_dummies(features.country,
                                        prefix='country'),
                         how='right')
suppr_cols = ['proj_name','country', 'cat_prim', 'cat_name','crea_name']

features = features.drop(suppr_cols, axis =1)

> Nous allons diviser notre dataset en un ensemble d'entrainement et un ensemble de test (ensemble d'entrainement =20% des données)

In [3]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(features,
                                                                            target, 
                                                                            test_size=0.2)

## Modèle de régression Logistique

In [4]:
clf = linear_model.LogisticRegression(C=1.0)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

cm = pd.crosstab(y_test, y_pred, rownames=['Classe réelle'], colnames=['Classe prédite'])

print('Score = ', clf.score(X_test, y_test))

cm

Score =  0.5611982340673329


Classe prédite,1
Classe réelle,Unnamed: 1_level_1
0,16201
1,20720


> - Amélioration du modèle

In [5]:
parametres = {'C': [0.1, 1.0,2,10],
              'penalty':['l1', 'l2', 'elasticnet']          
             }
grid_clf = model_selection.GridSearchCV(estimator = clf,
                                        param_grid=parametres)

grille = grid_clf.fit(X_train, y_train)
print(grid_clf.best_params_)
print('Score = ', grid_clf.score(X_test, y_test))

y_pred = grid_clf.predict(X_test)
cm = pd.crosstab(y_test, y_pred, rownames=['Classe réelle'], colnames=['Classe prédite'])
cm

{'C': 0.1, 'penalty': 'l2'}
Score =  0.5611982340673329


Classe prédite,1
Classe réelle,Unnamed: 1_level_1
0,16201
1,20720


### Modèle SVM

In [6]:
# Créer un classificateur SVM
clf = svm.SVC(kernel='poly', # Kernel à utiliser
              gamma=0.01) # Coefficient pour le noyau

# Entrainement du modèle
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

cm = pd.crosstab(y_test, y_pred, rownames=['Classe réelle'], colnames=['Classe prédite'])

print('Score = ', clf.score(X_test, y_test))

cm

Score =  0.5611982340673329


Classe prédite,1
Classe réelle,Unnamed: 1_level_1
0,16201
1,20720


### Arbres de décisions

In [7]:
# Création du classificateur 
dt_clf = DecisionTreeClassifier(criterion = 'entropy', 
                                max_depth=6,  # profondeur max de l'arbre
                                random_state=123)

# Entrainement du modèle
dt_clf.fit(X_train, y_train)

y_pred = dt_clf.predict(X_test)

cm = pd.crosstab(y_test, y_pred, rownames=['Classe réelle'], colnames=['Classe prédite'])

print('Score = ', dt_clf.score(X_test, y_test))

cm

Score =  0.7530944448958588


Classe prédite,0,1
Classe réelle,Unnamed: 1_level_1,Unnamed: 2_level_1
0,12095,4106
1,5010,15710


In [8]:
feats = {}
for feature, importance in zip(features.columns, dt_clf.feature_importances_):
    feats[feature] = importance 
    
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Importance'})
importances.sort_values(by='Importance', ascending = False ).head(8)

Unnamed: 0,Importance
cat_id,0.456128
coup_de_coeur,0.185644
goal,0.156858
sous_cat_Playing Cards,0.070621
cat_Fashion,0.032284
cat_Comics,0.031341
sous_cat_Gadgets,0.020872
annee,0.020387


#### criterion = 'gini'

In [12]:
# Création du classificateur 
dt_clf = DecisionTreeClassifier(criterion = 'gini', 
                                max_depth=6,  # profondeur max de l'arbre
                                random_state=123)

# Entrainement du modèle
dt_clf.fit(X_train, y_train)

y_pred = dt_clf.predict(X_test)

cm = pd.crosstab(y_test, y_pred, rownames=['Classe réelle'], colnames=['Classe prédite'])

print('Score = ', dt_clf.score(X_test, y_test))

cm

Score =  0.7605698653882614


Classe prédite,0,1
Classe réelle,Unnamed: 1_level_1,Unnamed: 2_level_1
0,11010,5191
1,3649,17071


In [13]:
feats = {}
for feature, importance in zip(features.columns, dt_clf.feature_importances_):
    feats[feature] = importance 
    
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Importance'})
importances.sort_values(by='Importance', ascending = False ).head(10)

Unnamed: 0,Importance
cat_id,0.43437
coup_de_coeur,0.18046
goal,0.140758
sous_cat_Playing Cards,0.070796
annee,0.033572
sous_cat_Hip-Hop,0.028076
cat_Art,0.027892
sous_cat_Software,0.025456
sous_cat_Gadgets,0.0199
duree_projet,0.011614


### Fôrets aléatoires

In [14]:
# Création du classificateur 
rf_clf = ensemble.RandomForestClassifier(n_jobs = -1, random_state = 321)

# Entrainement du modèle
rf_clf.fit(X_train, y_train)

y_pred = rf_clf.predict(X_test)

cm = pd.crosstab(y_test, y_pred, rownames=['Classe réelle'], colnames=['Classe prédite'])

print('Score = ', rf_clf.score(X_test, y_test))

cm

Score =  0.8260610492673546


Classe prédite,0,1
Classe réelle,Unnamed: 1_level_1,Unnamed: 2_level_1
0,13669,2532
1,3890,16830


In [15]:
feats = {}
for feature, importance in zip(features.columns, rf_clf.feature_importances_):
    feats[feature] = importance 
    
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Importance'})
importances.sort_values(by='Importance', ascending = False ).head(10)

Unnamed: 0,Importance
goal,0.125031
cat_id,0.121634
crea_id,0.087198
annee,0.08359
proj_name_len,0.081325
proj_desc_len,0.077024
duree_projet,0.073205
coup_de_coeur,0.062149
sous_cat_others,0.011961
sous_cat_Playing Cards,0.010437


## Analyse de sentiment

### Méthode 1

In [61]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import GradientBoostingClassifier

# Séparer la variable explicative de la variable à prédire
X, y = data.proj_name, data.success

# Séparer le jeu de données en données d'entraînement et données test 
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

In [62]:
# Initialiser un objet vectorisateur
vectorizer = CountVectorizer()

# Mettre à jour la valeur de X_train et X_test
X_train = vectorizer.fit_transform(X_train).todense()
X_test = vectorizer.transform(X_test).todense()

MemoryError: Unable to allocate 91.1 GiB for an array with shape (147682, 82834) and data type int64

### Méthode 2 

In [63]:
import re

def traitement(text):

    text = text.lower()
    text = text.replace('\n', ' ').replace('\r', '')
    text = ' '.join(text.split())
    text = re.sub(r"[A-Za-z\.]*[0-9]+[A-Za-z%°\.]*", "", text)
    text = re.sub(r"(\s\-\s|-$)", "", text)
    text = re.sub(r"[,\!\?\%\(\)\/\"]", "", text)
    text = re.sub(r"\&\S*\s", "", text)
    text = re.sub(r"\&", "", text)
    text = re.sub(r"\+", "", text)
    text = re.sub(r"\#", "", text)
    text = re.sub(r"\$", "", text)
    text = re.sub(r"\?", "", text)
    text = re.sub(r"\£", "", text)
    text = re.sub(r"\%", "", text)
    text = re.sub(r"\:", "", text)
    text = re.sub(r"\@", "", text)
    text = re.sub(r"\-", "", text)
    text = re.sub(r"乡音", "", text)

    return text




# Séparer la variable explicative de la variable à prédire
X, y = data.proj_name, data.success

X = X.apply(traitement)
# Séparer le jeu de données en données d'entraînement et données test 
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)



In [64]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

le = LabelEncoder()
le.fit(X_train)
X_train = le.transform(X_train)
#X_test = le.transform(X_test)
X_train.shape

(147682,)

In [58]:
X_le_mat = X_train.reshape((X_train.shape[0], 1))

In [65]:
ohe = OneHotEncoder(categories="auto")
ohe.fit(X_le_mat)

OneHotEncoder()

In [66]:
X_le_encoded = ohe.transform(X_le_mat)
train_cat = X_le_encoded.todense()
test_cat = ohe.transform(le.transform(X_test).reshape((len(X_test), 1))).todense()

MemoryError: Unable to allocate 161. GiB for an array with shape (147682, 146255) and data type float64