<a href="https://colab.research.google.com/github/Evandro72/04-Text-Classification-WEBTEX/blob/main/04_Text_Classification_WEBTEXT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment - Solution
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [35]:
import numpy as np
import pandas as pd

df = pd.read_csv('AllTagsTable.csv')
df.head()

Unnamed: 0,Index,Content,Tag
0,1,Educação não se faz somente construindo escola...,edu
1,2,Começamos hoje a construção da Escola Municipa...,edu
2,3,Excelência Santa Luzia precisa do seu olhar . ...,edu
3,4,Educação não se faz somente construindo escola...,edu
4,5,"Vamos ter salas climatizadas, quadra poliespor...",edu


### Task #2: Check for missing values:

In [28]:
# Check for NaN values:
df.isnull().sum()

Index      0
Content    0
Tag        0
dtype: int64

In [29]:
df['Content'] = df['Content'].apply(lambda x: np.str_(x))

In [31]:
df['Tag'] = df['Tag'].apply(lambda x: np.str_(x))

In [32]:
df

Unnamed: 0,Index,Content,Tag
0,1,Educação não se faz somente construindo escola...,edu
1,2,Começamos hoje a construção da Escola Municipa...,edu
2,3,Excelência Santa Luzia precisa do seu olhar . ...,edu
3,4,Educação não se faz somente construindo escola...,edu
4,5,"Vamos ter salas climatizadas, quadra poliespor...",edu
...,...,...,...
1083,1084,Isso vai ser o q? Camarote? Posto da PM!? Logo...,car
1084,1085,Bruno Reis cadê a programação do carnaval de S...,car
1085,1086,"Tá vacilando viu meu prefeito, cadastramento ...",car
1086,1087,Fala logo aí com o posto de ACM Neto para sus...,car


In [33]:

df.head()

Unnamed: 0,Index,Content,Tag
0,1,Educação não se faz somente construindo escola...,edu
1,2,Começamos hoje a construção da Escola Municipa...,edu
2,3,Excelência Santa Luzia precisa do seu olhar . ...,edu
3,4,Educação não se faz somente construindo escola...,edu
4,5,"Vamos ter salas climatizadas, quadra poliespor...",edu


In [36]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

len(blanks)

ValueError: too many values to unpack (expected 3)

### Task #3:  Remove NaN values:

In [37]:
df.dropna(inplace=True)

### Task #4: Take a quick look at the `label` column:

In [39]:
df['Tag'].value_counts()

car    617
edu    282
sau    178
emp     11
Name: Tag, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [40]:
from sklearn.model_selection import train_test_split

X = df['Content']
y = df['Tag']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [41]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)

### Task #7: Run predictions and analyze the results

In [42]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [43]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[195   1   0   2]
 [  1  84   0   6]
 [  0   1   0   0]
 [  5  15   0  50]]


In [44]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         car       0.97      0.98      0.98       198
         edu       0.83      0.92      0.87        91
         emp       0.00      0.00      0.00         1
         sau       0.86      0.71      0.78        70

    accuracy                           0.91       360
   macro avg       0.67      0.66      0.66       360
weighted avg       0.91      0.91      0.91       360



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [45]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.9138888888888889


## Great job!