# **Objective of the notebook:**



1.   Give an insight to Text Classification using  TFIDVectorizer 
2.   Understanding the usage of Pipeline and its efficiency 
3.   Understanding that all data is same for the ML model and that there are a set of specific steps in creating a ML (Machine Learning) model .
4.   Understanding the importance of Data Cleaning and Data Analysis in a creating a effective model.

### **Key areas to improve the result:**

1.   Using different Classifier other than LinearSVC , you can go for K Nearest Neighbors, Decision Trees, Random Forest , Logistic Regression, Naive Bayes and so on .
2.   Fine Tune the Hyperparameters to achieve better predictions
3.   Using different Vectorizers like CountVectorizer
4.   Removing stopwords (Don't know if this would help much )

#### **Reference Materials :**


1.   [TFIDFVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

2.   [Pipline Function](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)  

3.   [Text Classification](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a)




In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df1 = pd.read_csv('/content/amazonreviews.tsv',sep='\t')
df1.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [5]:
df1.isnull().sum()

label     0
review    0
dtype: int64

In [6]:
blanks1 = []
for index,value in df1.review.iteritems():
  if type(value) == str:
    if value.isspace():
      if value.isaplha():
        blanks1.append(index)

print(blanks1)

[]


In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X1 = df1['review']
y1 = df1['label']
X_train1,X_test1,y_train1,y_test1 = train_test_split(X1,y1,test_size=0.33,random_state=101)

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

In [0]:
from sklearn.pipeline import Pipeline

In [0]:
pp1 = Pipeline([('tfidf',TfidfVectorizer()),('linearSVC',LinearSVC())])

In [12]:
pp1.fit(X_train1,y_train1)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('linearSVC',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
         

In [0]:
pred1 = pp1.predict(X_test1)

In [0]:
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix

In [15]:
print(accuracy_score(y_test1,pred1))
print(classification_report(y_test1,pred1))
print(confusion_matrix(y_test1,pred1))

0.8651515151515151
              precision    recall  f1-score   support

         neg       0.86      0.87      0.87      1668
         pos       0.87      0.86      0.86      1632

    accuracy                           0.87      3300
   macro avg       0.87      0.87      0.87      3300
weighted avg       0.87      0.87      0.87      3300

[[1451  217]
 [ 228 1404]]


In [16]:
df2 = pd.read_csv('/content/moviereviews.tsv',sep='\t')
df2.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [17]:
df2.isnull().sum()

label      0
review    35
dtype: int64

In [0]:
df2.dropna(inplace=True)

In [20]:
blanks2 = []
for index,value in df2.review.iteritems():
  if type(value) == str:
    if value.isspace():
      if value.isalpha():
        blanks2.append(index)

print(blanks2)

[]


In [0]:
X2 = df2['review']
y2 = df2['label']
X_train2,X_test2,y_train2,y_test2 = train_test_split(X2,y2,test_size=0.33,random_state=101)

In [0]:
pp2 = Pipeline([('tfidf',TfidfVectorizer()),('linearSVC',LinearSVC())])

In [23]:
pp2.fit(X_train2,y_train2)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('linearSVC',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
         

In [41]:
pred2 = pp2.predict(X_test2)
print(accuracy_score(y_test2,pred2))
print(classification_report(y_test2,pred2))
print(confusion_matrix(y_test2,pred2))

0.8181818181818182
              precision    recall  f1-score   support

         neg       0.80      0.82      0.81       309
         pos       0.83      0.81      0.82       340

    accuracy                           0.82       649
   macro avg       0.82      0.82      0.82       649
weighted avg       0.82      0.82      0.82       649

[[254  55]
 [ 63 277]]


In [25]:
df3 = pd.read_csv('/content/moviereviews2.tsv',sep='\t')
df3.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


In [26]:
df3.isnull().sum()

label      0
review    20
dtype: int64

In [0]:
df3.dropna(inplace=True)

In [28]:
blanks3 = []
for index,value in df3.review.iteritems():
  if type(value) == str:
    if value.isspace():
      if value.isalpha():
        blanks3.append(index)

print(blanks3)

[]


In [0]:
X3 = df3['review']
y3 = df3['label']
X_train3,X_test3,y_train3,y_test3 = train_test_split(X3,y3,test_size=0.33,random_state=101)

In [42]:
pp3 = Pipeline([('tfidf',TfidfVectorizer()),('linearSVC',LinearSVC())])
pp3.fit(X_train3,y_train3)
pred3 = pp3.predict(X_test3)
print(accuracy_score(y_test3,pred3))
print(classification_report(y_test3,pred3))
print(confusion_matrix(y_test3,pred3))

0.9224924012158054
              precision    recall  f1-score   support

         neg       0.93      0.92      0.92      1000
         pos       0.92      0.93      0.92       974

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974

[[916  84]
 [ 69 905]]


In [34]:
df4 = pd.read_csv('/content/smsspamcollection.tsv',sep='\t')
df4.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [35]:
df4.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [37]:
blanks4= []
for index,value in df4.message.iteritems():
  if type(value) == str:
    if value.isspace():
      if value.isalpha():
        blanks4.append(index)

print(blanks4)

[]
