<hr>

# Kickoffs - IMDB Movie Review Analysis üçø

*Given a dataset consisting of reviews posted on IMDB for movies and series. Though the dataset may seem as a CSV file, it's record comes with a little toss of HTML code with it. We need to analyze this data using NLP techniques and get some useful insights out of it.*

##### After completing this challenge, you will be able to:
+ Understand the concepts of Data preprocessing.
+ NLP Concepts.
+ Machine learning models using SKlearn and NLTK.
+ Intense data pre-processing methods

<hr>


In [1]:
import pandas as pd

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

from sklearn.metrics import accuracy_score, classification_report
from collections import Counter
import warnings

warnings.filterwarnings('ignore')

<hr>

## Task 1 : Data Loading and Exploration
+ Load the IMDB reviews dataset `imdb.csv` to a pandas dataframe named `data`.
- Utilize Pandas functions (`info()`, `head()`, `describe()`) to explore the structure, columns, and initial samples of the `data` dataset.
- Analyze the distribution and characteristics of text features in the dataset.

<hr>

In [2]:
data = pd.read_csv('imdb.csv')

In [3]:
# Explore the dataset
print(data.info())
print(data.head())
print(data.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     2000 non-null   object
 1   sentiment  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB
None
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive
                                                   review sentiment
count                                                2000      2000
unique                                               2000         2
top     One of the other reviewers has mentioned that ...  positive
freq                                      

In [4]:
text_distribution = data['review'].apply(len)
print(text_distribution.describe())

count    2000.00000
mean     1282.17500
std       945.45511
min        98.00000
25%       697.75000
50%       952.50000
75%      1573.25000
max      8180.00000
Name: review, dtype: float64


<hr>

## Task 2 : Data Preprocessing
+ Complete the following function `clean_text` to remove HTML tags, punctuations and stopwords for the `review` column.
+ Using regex - remove the HTML tags, remove all the punctuations in each record.
+ Tokenize the text data and remove all the stopwords in it. Join the tokens back into a sentence once the removal process is complete and return the cleaned text.
+ Apply the `clean_text` function on the `reviews` column of `data` dataset and store it in a new column called `clean_review`.

***Sample dataset after cleaning***:

review | sentiment | clean_review
------ | ---------- | -----------
A wonderful little production. \<br />\<br />The filming technique is very unassuming- | positive | wonderful little production filming technique unassuming
Probably my all-time favorite movie | positive | Probably alltime favorite movie

**Note:** Do not modify the function name.
<br>
<hr>


In [5]:
### Do not modify function name
def clean_text(text):
    ## Write your function code here
    ## Start of coding block ##
    clean_text = re.sub(r'<.*?>', '', text)
    clean_text = re.sub(r'[^\w\s]', '', clean_text)
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(clean_text)
    filtered_text = [word for word in word_tokens if word.lower() not in stop_words]
    clean_text = ' '.join(filtered_text)

    ## End of coding block ##
    return clean_text
 
# Apply cleaning function to the 'review' column
data['clean_review'] = data['review'].apply(clean_text)

<hr>

## Task 3 : Sentiment Analysis Model Building
+ Prepare the data and labels; store them in `X` & `y` respectively.
+ Create an instance of TF-IDF Vectorization with max features set to 5000 in variable `tfidf_vectorizer`.
+ Fit and transform the data extracted and store it in `X_tfidf`.
+ Split the dataset into training and testing named X_train, X_test, y_train, y_test with the newly transformed data and the labels with a test size of 20% and random state set to 42.
+ Initialize SVM classifier with seed value of `42` stored in variable `svm`.
+ Fit the training data to the classifier and gather the predictions against the testing data and store it in `y_pred`. 
+ Evaluate the score for predictions against the testing data and store the output in `accuracy`.
+ Get the classification report and of the predictions and the testing data as a dictionary and store it in `report`.

<hr>

In [7]:
# Prepare data and labels
X = data['clean_review']
y = data['sentiment']
# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_tfidf = tfidf_vectorizer.fit_transform(X)
 
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

In [9]:
# Initialize SVM classifier
svm = LinearSVC(random_state=42)
 
# Train the classifier
svm.fit(X_train, y_train)
 
# Predictions
y_pred = svm.predict(X_test)
 
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
 
# Classification report
report = classification_report(y_test, y_pred, output_dict=True)

Accuracy: 0.835


<hr>

## Task 4 : Word Frequency Analysis
+ Get the most common Top 10 words from the cleaned review data and store it as a dictionary.
+ Name the dictionary as `wc_dict`. Example: {word1: count1, word2: count2}
+ Use tokenization and counting methods to calculate word frequencies.

<br>
<hr>

In [10]:
# Analyze word frequency in the IMDb dataset
# Display the top 10 most common words
word_count = Counter(' '.join(data['clean_review']).split())
wc_dict = dict(word_count.most_common(10))

<hr>

## Task 5 : Average Word Length Calculation
- Compute the average word length in characters across the text dataset and store the result in `avg_word_length`.
- Tokenize the text and calculate the average length of tokens.

<hr>

In [12]:
# Calculate average word length
tokens = word_tokenize(' '.join(data['clean_review']))
avg_word_length = sum(len(token) for token in tokens) / len(tokens)

### <span style="color:red"> ! Note : After you finish solving the problem, please run the below cell to save your answers for testing.

In [13]:
### Do not modify this block
from test_imdb import test_imdb
try:
    test_imdb.save_answer(data, clean_text,X, y, tfidf_vectorizer,X_train, X_test, y_train, y_test,svm, y_pred, accuracy, report, wc_dict, avg_word_length)
except:
    print("Assign the answers to all the variables properly")
    test_imdb.remove_pickle()
    try:
        test_imdb.save_ans1(data, clean_text, X, y)
    except:
        pass
    try:
        test_imdb.save_ans2(tfidf_vectorizer, X_train, X_test, y_train, y_test)
    except:
        pass
    try:
        test_imdb.save_ans3(svm, y_pred)
    except:
        pass
    try:
        test_imdb.save_ans4(accuracy, report)
    except:
        pass
    try:
        test_imdb.save_ans5(wc_dict, avg_word_length)
    except:
        pass
####

Test Case 1 Passed
Test Case 2 Passed
Test Case 3 Passed
Test Case 4 Passed
Test Case 5 Passed
