# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [2]:
# Perform imports and load the dataset:
import numpy as np
import pandas as pd

# Correctly specifying the file path using a raw string
file_path = r'C:\Users\iamke\OneDrive\Desktop\NLP\moviereviews2.tsv'

# Load the dataset
df = pd.read_csv(file_path, sep='\t')

# Display the first few rows of the dataframe
print(df.head())


  label                                             review
0   pos  I loved this movie and will watch it again. Or...
1   pos  A warm, touching movie that has a fantasy-like...
2   pos  I was not expecting the powerful filmmaking ex...
3   neg  This so-called "documentary" tries to tell tha...
4   pos  This show has been my escape from reality for ...


### Task #2: Check for missing values:

In [3]:
df.isnull().sum()

label      0
review    20
dtype: int64

In [4]:
# Check for whitespace strings (it's OK if there aren't any!):
whitespace = (df == ' ').sum()
whitespace

label     0
review    0
dtype: int64

### Task #3: Remove NaN values:

In [5]:
df.dropna(inplace=True)

### Task #4: Take a quick look at the `label` column:

In [6]:
df['label'].value_counts()

label
pos    2990
neg    2990
Name: count, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [7]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [34]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

text_clf.fit(X_train, y_train)



### Task #7: Run predictions and analyze the results

In [32]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [38]:
from sklearn import metrics

print("Shape of y_test:", y_test.shape)
print("Shape of predictions:", predictions.shape)


print("y_test:", y_test)
print("predictions:", predictions)
print("Confusion Matrix:")



Shape of y_test: (4,)
Shape of predictions: (1974,)
y_test: [1 0 1 0]
predictions: [0 0 0 ... 0 0 0]
Confusion Matrix:


In [43]:
from sklearn import metrics 


print("Shape of y_test:", y_test.shape)
print("Shape of predictions:", predictions.shape)


print("y_test:", y_test)
print("predictions:", predictions)




Shape of y_test: (4,)
Shape of predictions: (1974,)
y_test: [1 0 1 0]
predictions: [0 0 0 ... 0 0 0]


In [46]:
print("Shape of y_test:", y_test.shape)
print("Shape of predictions:", predictions.shape)


print("y_test:", y_test)
print("predictions:", predictions)


try:
    accuracy = metrics.accuracy_score(y_test, predictions)
    print("Overall Accuracy:", accuracy)
except Exception as e:
    print("Error occurred while computing accuracy:", e)


Shape of y_test: (4,)
Shape of predictions: (1974,)
y_test: [1 0 1 0]
predictions: [0 0 0 ... 0 0 0]
Error occurred while computing accuracy: Found input variables with inconsistent numbers of samples: [4, 1974]
