- Some hints for hacking our challenge:
- Ask yourself why would they have selected this problem for the challenge? What are some gotchas in this domain I should know about?
- What is the highest level of accuracy that others have achieved with this dataset or similar problems / datasets ?
- What types of visualizations will help me grasp the nature of the problem / data?
- What feature engineering might help improve the signal?
- Which modeling techniques are good at capturing the types of relationships I see in this data?
- Now that I have a model, how can I be sure that I didn't introduce a bug in the code? If results are too good to be true, they probably are!
- What are some of the weaknesses of the model and and how can the model be improved with additional work
- List Of Challenges for Cohort 25 and 26
- Choose a CV or NLP problem. Do a thorough Exploratory Data Analysis of the dataset and report the final performance metrics for your approach.Suggest ways in which you can improve the model.

### Objective

### Current Solution

### Frame the Problem

### Performance Measure

### Data Source
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data

### Constants

In [14]:
DATA_PATH = '../Data/Raw/IMDB Dataset.csv'
PREPROCESSED_PATH = "../Data/Processed/preprocessed_df.pkl"

### Packages

In [60]:
import numpy as np
import pandas as pd

import logging
import pickle
from pathlib import Path

import re
import string
from nltk.corpus import stopwords

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.figure_factory as ff
import plotly.io as pio

from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score


# pd.options.display.max_rows = 10000
# pd.options.display.max_columns = 10000

### Functions

### Load Dataset

In [16]:
# Read Dataset and print shape
raw_df = pd.read_csv(DATA_PATH)
raw_df.shape

(50000, 2)

### Data Preprocessing

In [17]:
raw_df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [18]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


- The Dataset Doesn't have null values

In [19]:
# Check for Duplicates
raw_df.duplicated().value_counts()

False    49582
True       418
dtype: int64

- The Dataset Contains 418 Duplicated samples

In [20]:
# Remove the Duplicates
raw_df = raw_df.drop_duplicates()

In [21]:
# Check whether the dataset is balanced or imbalanced?
raw_df['sentiment'].value_counts()

positive    24884
negative    24698
Name: sentiment, dtype: int64

- The Dataset is Balanced

In [22]:
# Check whether any empty reviews exist
raw_df['length'] = raw_df['review'].apply(len)
print(len(raw_df[raw_df['length'] == 0]))
raw_df = raw_df.drop(columns='length')

0


- The Dataset doesn't have empty reviews

In [30]:
df = raw_df.copy()


In [183]:
raw_df = df.copy()

In [184]:
# apply text preprocessing

In [185]:
# Convert text to lowercase
raw_df['review'] = raw_df['review'].str.lower()

In [186]:
# Remove HTML tags
raw_df['review'] = raw_df['review'].apply(lambda x: re.sub('<[^<]+?>', ' ', x))

In [189]:
# Remove Punctuations
raw_df['review'] = raw_df['review'].apply(lambda x: re.sub(r'-', ' ', x))
# raw_df['review'] = raw_df['review'].str.translate(str.maketrans('', '', string.punctuation))
raw_df['review'] = raw_df['review'].apply(lambda x: re.sub(f"[{re.escape(string.punctuation)}]",' ', x))

In [190]:
# Remove Digits
raw_df['review'] = raw_df['review'].apply(lambda x: re.sub(r'\d+', '', x))

In [192]:
# Remove StopWords
stop_words = stopwords.words('english')
raw_df['review'] = raw_df['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

In [202]:
# Verify your Results
i = df.sample(1).index[0]
# i = 100
print(raw_df['review'].iloc[i])
print('###########################################################')
print(df['review'].iloc[i])

beautiful film set hong kong man mr chow woman mrs chan become close friends suspect spouses affair stylistically film also beautiful wong kar wai uses lot slow motion close ups parts body feet hands waist film reticence properness suggests time period sexy without showing everything wong kar wai also allow audience see spouses look like suggesting mr chow mrs chan together smoking even made look elegant close ups curls smoke really lovely film prepare ending
###########################################################
beautiful film set in 1962 hong kong about a man (mr. chow) and woman (mrs. chan) who become close friends when they suspect their spouses are having an affair. stylistically, the film is also beautiful. wong kar-wai uses a lot of slow motion and close-ups on parts of the body (feet, hands, waist). the film itself has a reticence and properness that suggests its time period. it's sexy without showing everything. wong kar-wai also doesn't allow the audience to see what the

### Exploratory Data Analysis
#### Questions wee need to answer

In [None]:
raw_df

In [None]:
raw_df.describe()

In [None]:
raw_df.loc[raw_df['length'] == 32].iloc[0,0]

### Feature Engineering

### Model Building