# scikit-learn reference

API reference:  
https://scikit-learn.org/stable/modules/classes.html

## 1. Data preprocessing

### 1.1. Feature scaling

sklearn.preprocessing.**MinMaxScaler**  
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

sklearn.preprocessing.**StandardScaler**  
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

### 1.2. Text preprocessing

sklearn.feature_extraction.text.**CountVectorizer**  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

**snowballstemmer** (non-scikit-learn)  
https://pypi.org/project/snowballstemmer/

## 2. Sample split

sklearn.model_selection.**train_test_split**  
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

sklearn.model_selection.**KFold**  
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

## 3. Performance evaluation metrics

### 3.1. Classification performance metrics

sklearn.metrics.**confusion_matrix**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

sklearn.metrics.**accuracy_score**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

sklearn.metrics.**precision_score**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

sklearn.metrics.**recall_score**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

sklearn.metrics.**f1_score**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

### 3.2. Regression performance metrics

sklearn.metrics.**mean_absolute_error**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html

sklearn.metrics.**mean_squared_error**  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

### 3.3. Encapsulated cross-validation

sklearn.model_selection.**cross_val_score**  
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

## 4. Predictive models

sklearn.cluster.**KMeans**  
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

sklearn.linear_model.**Perceptron**  
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html

sklearn.linear_model.**LinearRegression**  
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

sklearn.linear_model.**LogisticRegression**  
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

sklearn.neural_network.**MLPClassifier**  
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

## 5. Bonus task! Text classification

Naive Bayes methods are pretty good at text classification! See theory basics and API reference:  
https://scikit-learn.org/stable/modules/naive_bayes.html  

Now, download the Amazon Review dataset:  
https://gist.github.com/kunalj101/ad1d9c58d338e20d09ff26bcc06c4235

Make train-test split and train sklearn.naive_bayes.**GaussianNB** classifier to predict whether customer reviews are positive or negative. Then evaluate it's performance!  
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

*Improvement*: when you're done, try applying word stemming to enhance the model's performance! 

In [1]:
import string
import os
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

In [2]:
pd.set_option('max_colwidth', 160)

In [3]:
DATA_DIR = 'data'
FILE_NAME = 'corpus.csv'

file_path = os.path.join(DATA_DIR, FILE_NAME)

In [4]:
df = pd.read_csv(file_path, sep='\228__228', names=['text'])

  """Entry point for launching an IPython kernel.


In [5]:
df.head()

Unnamed: 0,text
0,__label__2 Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who ...
1,__label__2 The best soundtrack ever to anything.: I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a...
2,"__label__2 Amazing!: This soundtrack is my favorite music of all time, hands down. The intense sadness of ""Prisoners of Fate"" (which means all the more if y..."
3,__label__2 Excellent Soundtrack: I truly like this soundtrack and I enjoy video game music. I have played this game and most of the music on here I enjoy an...
4,"__label__2 Remember, Pull Your Jaw Off The Floor After Hearing it: If you've played the game, you know how divine the music is! Every single song tells a st..."


In [6]:
df['label'] = df.apply(lambda item: item['text'].split()[0], axis=1)
df['text'] = df.apply(lambda item: ' '.join(item['text'].split()[1:]), axis=1)

In [7]:
df.head(10)

Unnamed: 0,text,label
0,Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. g...,__label__2
1,The best soundtrack ever to anything.: I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to ...,__label__2
2,"Amazing!: This soundtrack is my favorite music of all time, hands down. The intense sadness of ""Prisoners of Fate"" (which means all the more if you've playe...",__label__2
3,Excellent Soundtrack: I truly like this soundtrack and I enjoy video game music. I have played this game and most of the music on here I enjoy and it's trul...,__label__2
4,"Remember, Pull Your Jaw Off The Floor After Hearing it: If you've played the game, you know how divine the music is! Every single song tells a story of the ...",__label__2
5,"an absolute masterpiece: I am quite sure any of you actually taking the time to read this have played the game at least once, and heard at least a few of th...",__label__2
6,"Buyer beware: This is a self-published book, and if you want to know why--read a few paragraphs! Those 5 star reviews must have been written by Ms. Haddon's...",__label__1
7,Glorious story: I loved Whisper of the wicked saints. The story was amazing and I was pleasantly surprised at the changes in the book. I am not normaly some...,__label__2
8,"A FIVE STAR BOOK: I just finished reading Whisper of the Wicked saints. I fell in love with the caracters. I expected an average romance read, but instead I...",__label__2
9,"Whispers of the Wicked Saints: This was a easy to read book that made me want to keep reading on and on, not easy to put down.It left me wanting to read the...",__label__2


In [8]:
df.shape

(10000, 2)

In [9]:
df['label'].unique()

array(['__label__2', '__label__1'], dtype=object)

In [10]:
labels_dict = {
    '__label__1': 0,
    '__label__2': 1,
}

In [11]:
df.replace(to_replace={'label': labels_dict}, inplace=True)

In [12]:
df.head(10)

Unnamed: 0,text,label
0,Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. g...,1
1,The best soundtrack ever to anything.: I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to ...,1
2,"Amazing!: This soundtrack is my favorite music of all time, hands down. The intense sadness of ""Prisoners of Fate"" (which means all the more if you've playe...",1
3,Excellent Soundtrack: I truly like this soundtrack and I enjoy video game music. I have played this game and most of the music on here I enjoy and it's trul...,1
4,"Remember, Pull Your Jaw Off The Floor After Hearing it: If you've played the game, you know how divine the music is! Every single song tells a story of the ...",1
5,"an absolute masterpiece: I am quite sure any of you actually taking the time to read this have played the game at least once, and heard at least a few of th...",1
6,"Buyer beware: This is a self-published book, and if you want to know why--read a few paragraphs! Those 5 star reviews must have been written by Ms. Haddon's...",0
7,Glorious story: I loved Whisper of the wicked saints. The story was amazing and I was pleasantly surprised at the changes in the book. I am not normaly some...,1
8,"A FIVE STAR BOOK: I just finished reading Whisper of the Wicked saints. I fell in love with the caracters. I expected an average romance read, but instead I...",1
9,"Whispers of the Wicked Saints: This was a easy to read book that made me want to keep reading on and on, not easy to put down.It left me wanting to read the...",1


In [13]:
df['label'].value_counts()

0    5097
1    4903
Name: label, dtype: int64

In [14]:
def excape_punctuation(text):
    exclude = set(string.punctuation)
    return ''.join(c for c in text if c not in exclude)


def preprocess_text(text):
    text = text.lower()
    text = excape_punctuation(text)
    return text

In [15]:
df['text'] = df.apply(lambda item: preprocess_text(item['text']), axis=1)

In [16]:
df.head(10)

Unnamed: 0,text,label
0,stuning even for the nongamer this sound track was beautiful it paints the senery in your mind so well i would recomend it even to people who hate vid game ...,1
1,the best soundtrack ever to anything im reading a lot of reviews saying that this is the best game soundtrack and i figured that id write a review to disagr...,1
2,amazing this soundtrack is my favorite music of all time hands down the intense sadness of prisoners of fate which means all the more if youve played the ga...,1
3,excellent soundtrack i truly like this soundtrack and i enjoy video game music i have played this game and most of the music on here i enjoy and its truly r...,1
4,remember pull your jaw off the floor after hearing it if youve played the game you know how divine the music is every single song tells a story of the game ...,1
5,an absolute masterpiece i am quite sure any of you actually taking the time to read this have played the game at least once and heard at least a few of the ...,1
6,buyer beware this is a selfpublished book and if you want to know whyread a few paragraphs those 5 star reviews must have been written by ms haddons family ...,0
7,glorious story i loved whisper of the wicked saints the story was amazing and i was pleasantly surprised at the changes in the book i am not normaly someone...,1
8,a five star book i just finished reading whisper of the wicked saints i fell in love with the caracters i expected an average romance read but instead i fou...,1
9,whispers of the wicked saints this was a easy to read book that made me want to keep reading on and on not easy to put downit left me wanting to read the fo...,1


In [17]:
X = df['text'].values
y = df[['label']].values

In [18]:
count_vectorizer = CountVectorizer()

In [19]:
X = count_vectorizer.fit_transform(X)

In [20]:
X.shape

(10000, 40157)

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

In [22]:
X_train.shape

(9000, 40157)

In [23]:
gnb = GaussianNB()

In [24]:
gnb.fit(X_train.toarray(), y_train.ravel())

GaussianNB(priors=None, var_smoothing=1e-09)

In [25]:
y_train_pred = gnb.predict(X_train.toarray())
y_test_pred = gnb.predict(X_test.toarray())

In [26]:
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)

In [27]:
print('{:.3f}'.format(train_acc))
print('{:.3f}'.format(test_acc))

0.939
0.670


In [28]:
train_prec = precision_score(y_train, y_train_pred)
test_prec = precision_score(y_test, y_test_pred)

In [29]:
print('{:.3f}'.format(train_prec))
print('{:.3f}'.format(test_prec))

0.997
0.729
