# scikit-learn reference

API reference:  
https://scikit-learn.org/stable/modules/classes.html

## 1. Data preprocessing

### 1.1. Feature scaling

sklearn.preprocessing.**MinMaxScaler**  
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

sklearn.preprocessing.**StandardScaler**  
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

### 1.2. Text preprocessing

sklearn.feature_extraction.text.**CountVectorizer**  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

**snowballstemmer** (non-scikit-learn)  
https://pypi.org/project/snowballstemmer/

## 2. Sample split

sklearn.model_selection.**train_test_split**  
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

sklearn.model_selection.**KFold**  
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

## 3. Performance evaluation metrics

### 3.1. Classification performance metrics

sklearn.metrics.**confusion_matrix**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

sklearn.metrics.**accuracy_score**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

sklearn.metrics.**precision_score**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

sklearn.metrics.**recall_score**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

sklearn.metrics.**f1_score**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

### 3.2. Regression performance metrics

sklearn.metrics.**mean_absolute_error**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html

sklearn.metrics.**mean_squared_error**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

### 3.3. Encapsulated cross-validation

sklearn.model_selection.**cross_val_score**  
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

## 4. Predictive models

sklearn.cluster.**KMeans**  
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

sklearn.linear_model.**Perceptron**  
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html

sklearn.linear_model.**LinearRegression**  
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

sklearn.linear_model.**LogisticRegression**  
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

sklearn.neural_network.**MLPClassifier**  
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

## 5. Bonus task! Text classification

Naive Bayes methods are pretty good at text classification! See theory basics and API reference:  
https://scikit-learn.org/stable/modules/naive_bayes.html  

Now, download the Amazon Review dataset:  
https://gist.github.com/kunalj101/ad1d9c58d338e20d09ff26bcc06c4235

Make train-test split and train sklearn.naive_bayes.**GaussianNB** classifier to predict whether customer reviews are positive or negative. Then evaluate it's performance!  
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

*Improvement*: when you're done, try applying word stemming to enhance the model's performance! 

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import GaussianNB
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from nltk.stem import PorterStemmer
from sklearn.metrics import accuracy_score

# Preprocess

In [2]:
vectorizer = CountVectorizer()
tk = vectorizer.build_tokenizer()
stemmer = PorterStemmer()

def preprocess(x,stemm = True):
    labels,message = x[0].split(' ',1)
    return [labels,message]
def make_stemms(message):
    tokens = tk(message)
    stems = [stemmer.stem(word) for word in tokens]
    message = ' '.join(stems)
    return message

# Globals

In [3]:
rnd_state = 1

# Read and preprocess without stemmer

In [4]:
df = pd.read_csv("corpus.csv",sep='\n',header=None).apply(preprocess,axis=1,result_type='expand')

In [5]:
df.columns = ['label','msg']
df.head(1)

Unnamed: 0,label,msg
0,__label__2,Stuning even for the non-gamer: This sound tra...


In [6]:
df['label']=preprocessing.LabelEncoder().fit_transform(df['label'])
df.head(1)

Unnamed: 0,label,msg
0,1,Stuning even for the non-gamer: This sound tra...


## Bag of words

In [7]:
data = df['msg']
y = df['label'].values

In [8]:
vectorizer = CountVectorizer()
transformer = TfidfTransformer(smooth_idf=False)
X = transformer.fit_transform(vectorizer.fit_transform(data)).toarray()

In [9]:
(_,t11) = X.shape

## Train

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = rnd_state)

In [11]:
gnb = GaussianNB()

In [12]:
gnb.fit(X_train,y_train);

In [13]:
t12=gnb.score(X_train,y_train)

## Train prediction

In [14]:
y_train_pred = gnb.predict(X_train)

In [15]:
accuracy_score(y_train, y_train_pred)

0.971125

## Prediction on test set

In [16]:
t13 = gnb.score(X_test,y_test)

In [17]:
y_test_pred = gnb.predict(X_test)

In [18]:
accuracy_score(y_test, y_test_pred)

0.6515

# -----------------------------------------------------------

# Read and preprocess with stemmer

## Making stemms

In [19]:
df['msg']= [make_stemms(msg) for msg in df['msg']]

## Bag of words

In [20]:
data = df['msg']
y = df['label'].values

In [21]:
vectorizer = CountVectorizer()
transformer = TfidfTransformer(smooth_idf=False)
X = transformer.fit_transform(vectorizer.fit_transform(data)).toarray()

In [22]:
(_,t21)=X.shape

## Train

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = rnd_state)

In [24]:
gnb = GaussianNB()

In [25]:
gnb.fit(X_train,y_train);

## Train prediction

In [26]:
y_train_pred = gnb.predict(X_train)

In [27]:
t22 = accuracy_score(y_train, y_train_pred)

## Test prediction

In [28]:
y_test_pred = gnb.predict(X_test)

In [29]:
t23=accuracy_score(y_test, y_test_pred)

# With stemmer and stop words!!!!

In [30]:
df = pd.read_csv("corpus.csv",sep='\n',header=None).apply(preprocess,axis=1,result_type='expand')
df.columns = ['label','msg']
df['label']=preprocessing.LabelEncoder().fit_transform(df['label'])
df['msg']= [make_stemms(msg) for msg in df['msg']]
data = df['msg']
y = df['label'].values
vectorizer = CountVectorizer(stop_words = 'english')
transformer = TfidfTransformer(smooth_idf=False)
X = transformer.fit_transform(vectorizer.fit_transform(data)).toarray()

In [31]:
(_,t31)=X.shape

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = rnd_state)

In [33]:
gnb = GaussianNB()
gnb.fit(X_train,y_train);

In [34]:
y_train_pred = gnb.predict(X_train)
t32=accuracy_score(y_train, y_train_pred)

In [35]:
y_test_pred = gnb.predict(X_test)
t33=accuracy_score(y_test, y_test_pred)

# With stop words

In [40]:
df = pd.read_csv("corpus.csv",sep='\n',header=None).apply(preprocess,axis=1,result_type='expand')
df.columns = ['label','msg']
df['label']=preprocessing.LabelEncoder().fit_transform(df['label'])
data = df['msg']
y = df['label'].values
vectorizer = CountVectorizer(stop_words = 'english')
transformer = TfidfTransformer(smooth_idf=False)
X = transformer.fit_transform(vectorizer.fit_transform(data)).toarray()

In [41]:
(_,t41)=X.shape

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = rnd_state)

In [43]:
gnb = GaussianNB()
gnb.fit(X_train,y_train);

In [44]:
y_train_pred = gnb.predict(X_train)
t42=accuracy_score(y_train, y_train_pred)

In [45]:
y_test_pred = gnb.predict(X_test)
t43=accuracy_score(y_test, y_test_pred)

# SUMMARY

In [46]:
d = {'NoStemm':[t11,t12,t13],'Stemm':[t21,t22,t23],'Stemm+StopWords':[t31,t32,t33],'StopWords':[t41,t42,t43]}
rows = ['Number of words','Accuracy on Train','Accuracy on Test']

In [47]:
df_report = pd.DataFrame(data = d,index=rows)

In [48]:
df_report

Unnamed: 0,NoStemm,Stemm,Stemm+StopWords,StopWords
Number of words,31627.0,22236.0,22020.0,31325.0
Accuracy on Train,0.971125,0.92425,0.92625,0.972875
Accuracy on Test,0.6515,0.647,0.645,0.651
