# Project :8 Working with Textual Data:  Text Classification

In the realm of entertainment, understanding audience sentiment towards movies is paramount for studios and production houses aiming to gauge the reception of their cinematic offerings. The IMDB movie review sentiment classification problem presents a crucial business challenge: accurately classifying movie reviews into positive or negative sentiments. What adds complexity to this task is the inherent variability in review lengths, the diverse vocabulary of words used, and the necessity for the model to discern the intricate long-term dependencies and contextual nuances embedded within the text.

  Create a first step document that lists the output of your exploratory analysis, any issues, or problems you may see with data that need follow-up, and some basic descriptive analysis that you think highlights important outcomes/findings from the data. Based on your findings, the next level of analysis will be charted out. Build a predictive model to classify the reviews into positive or negative. Perform a comparative study of several predictive models with various approaches and give your inferences accordingly.

**Dataset description:**  

There are 2 columns, review, and sentiment with 50000 records. ‘review’ column consists of imdb reviews. sentiment is the target variable with 2 classes (positive and negative)

**Dataset: IMDB Dataset.csv**

**Software Engineering aspect:**  

Utilize software engineering aspects while building the model using modular programming principles to organize your code into reusable functions or classes to enhance readability, maintainability, and collaboration.



**Initial Guidelines:**

1.	Ensure to follow to User Id’s provided by UNext for naming file as conventions.
2.	Create GitHub account and submit the GitHub link.
3. Task 1.5 to 2.6 may require use of GPU.
4. Learners can request for GPU based instance on demand by sending an email to `corpsupport@u-next.com`

### General Instructions

- The cells in the Jupyter notebook can be executed any number of times for testing the solution
- Refrain from modifying the boilerplate code as it may lead to unexpected behavior 
- The solution is to be written between the comments `# code starts here` and `# code ends here`
- On completing all the questions, the assessment is to be submitted on moodle for evaluation
- Before submitting the assessment, there should be `no error` while executing the notebook. If there are any error causing code, please comment it.
- The kernel of the Jupyter notebook is to be set as `Python 3 (ipykernel)` if not set already
- Include imports as necessary
- For each of the task, `Note` section will provide you hints to solve the problem.
- Do not use `PRINT` statement inside the `Except` Block. Please use `return` statement only within the except block

In [None]:
#Required imports


# Task 1: Load the dataset and perform preliminary EDA with key observations and insights- (weightage - 20 marks)

#### T1.1: Load the IMDB Dataset using try and except blocks .           (weightage - 2 marks) (AE)

#### Note:
- Define a function named `load_the_dataset()` that attempts to load data from a CSV file named "IMDB Dataset.csv".
- Use a `try-except` block inside the function.
- Try to read the CSV file using `pd.read_csv()`.
- If successful, return the dataset (`df`).
- If there's an error (e.g., file not found), return the message "File not found. Please check the file path."

In [None]:
def load_the_dataset():
    try:
        
        return 
    except :
        return 

- After defining the function, call it to load the dataset and assign it to the variable `df`.
- Print the first few rows of the dataset using `print(df.head())`.

In [None]:
# store the result of the dataset
df=load_the_dataset()
print(df.head())

#### T1.2: Check for distribution of target variable(percentage)(weightage - 2 marks)  (AE)             

#### Note:
- Define a function `target_class(df)` to analyze the distribution of a target variable ('sentiment') within a DataFrame (df).
- Inside the function, calculate the proportion of each unique value in the 'sentiment' column using the `value_counts` method with `normalize=True`.
- Multiply the proportions by 100 to obtain percentages.
- Return the calculated distribution.

In [None]:
def target_class(df):
    #code starts here
    
    #code ends
    return 


- Call the function with `df` as argument to get `target_distribution`.
- Print `target_distribution`.

In [None]:
# store the result
target_distribution = target_class(df)
print(target_distribution)

#### T1.3: Clean individual reviews: Remove all punctuations from words. Remove HTML tags. Remove words between square brackets. Remove all words that are not purely alphabetical characters. Convert all words to lowercase. Use error handling technique. (Weightage 5marks)(ME)

#### Note:
- Define a function `strip_html(text)` to remove HTML tags using BeautifulSoup.
- Define a function `remove_between_square_brackets(text)` to remove text within square brackets using regex.
- Define a function `denoise_text(text)` to apply `strip_html()` and `remove_between_square_brackets()` to lowercased text. Handle exceptions using `try-except`.
- Do not use `PRINT` statement inside the `Except` Block. Please use `return` statement only within the except block
- Define a function `remove_special_characters(text, remove_digits=True)` to remove special characters using regex.
- Define a function `remove_punctuation(text)` to remove punctuation using regex.
- Apply `denoise_text()`, `remove_special_characters()`, and `remove_punctuation()` functions to the 'review' column in DataFrame `df`.

In [None]:
def strip_html(text):
   
    return 

#Removing the square brackets
def remove_between_square_brackets(text):
    return 

#Removing the noisy text
def denoise_text(text):
  try:
   
    return 
  except:
    return
   

#Define function for removing special characters
def remove_special_characters(text, remove_digits=True):
    
    return 

#Apply function on review column
def remove_punctuation(text):
    # Define a regex pattern to match punctuation
   
    # Use the sub() function to replace punctuation with an empty string

    return 

In [None]:
# Assuming 'df' is your pandas DataFrame containing the IMDb dataset,
# and 'review_column' is the name of the column containing the reviews
df['review']=df['review'].apply(denoise_text)
df['review']=df['review'].apply(remove_special_characters)
df['review']=df['review'].apply(remove_punctuation)

- Print the first few rows of `df` to view the cleaned data.

In [None]:
df.head()

#### T1.4: Count the number of stopwords present. Remove all words that are known stop words.  (Use nltk)  (weightage - 2 marks)(AE)        

#### Note:
- Import necessary modules from NLTK for text preprocessing and download NLTK stopwords dataset if not already downloaded.
- Define two functions: `num_stopwords(review)` and `remove_stopwords(review)`.
- Tokenize the input review using NLTK's `word_tokenize` function.
- In `num_stopwords(review)`, count the number of stopwords present in the tokenized review.
- Return the count of stopwords.
- In `remove_stopwords(review)`, remove stopwords from the tokenized review.
- Join the cleaned tokens back into a string.
- Return the cleaned review.

In [None]:

# Download stopwords if not already downloaded


def num_stopwords(review):
    # Tokenize the review
    
    # Get English stopwords
   
    # Count the number of stopwords present
    return 

def remove_stopwords(review):

    # Tokenize the review
    
    # Get English stopwords
    
    # Remove stopwords
    
    # Join tokens back into a cleaned review
    
    return 

- Apply the `num_stopwords` method to each review in the 'review' column using the `apply` function.
- Sum the total number of stopwords across all reviews.

In [None]:
# Apply function to the specified column in the DataFrame


- Apply the `remove_stopwords` function to the 'review' column of DataFrame `df`.

In [None]:
# remove stopwords

- Remove the 'num_stopwords' column from the DataFrame `df`.

In [None]:
# Drop the num_stopwords column


#### __Task 1.5 to 2.6 may require use of GPU.__
#### Learners can request for GPU based instance on demand by sending an email to `corpsupport@u-next.com`.

## T1.5: Remove all words that have a length of 1 character. Count the number of such words removed.Perform lemmatization to reduce words to their base form. (weightage – 4 marks)   (AE) & (ME)       

#### Note:
- Define a function `count_short_words(review)` to count short words in a review.
- Tokenize the review into words.
- Count words with a length of 1.
- Return the count of short words.

In [None]:
def count_short_words(review):#AE
    # Split the review into tokens
    
    # Count words with length 1
    
    return 

- Define a function `count_short_words_apply` to count short words in a specified column of a DataFrame.
- Apply the `count_short_words` function to the specified column in the DataFrame and sum up the counts of short words.
- Return the total count of short words.

In [None]:
def count_short_words_apply(df, column_name):#AE
    # Apply count_short_words function to the specified column in the DataFrame
   
    return 

- Define a function called `remove_shortwords` to filter out short words from a given review.
- Tokenize the review into individual words using `word_tokenize`.
- Create a new list (`new`) containing only words with a length greater than 1, Join the filtered words back into a single string and return the filtered review string.

In [None]:
def remove_shortwords(review):#ME
 
  return 

- Apply the function `count_short_words_apply` to the DataFrame `df` using the column 'review'.
- Print the total number of words with a length of 1 character.

In [None]:
total_short_words = count_short_words_apply(df, 'review')#AE
print("Total number of words with length = 1 character:", total_short_words)

- Apply the `remove_shortwords` function to the 'review' column in DataFrame `df`.

In [None]:
# Apply remove_shortwords


- Define a function `simple_lemmatizer` to perform lemmatization on text data.
- Tokenize the input text into individual words.
- Perform part-of-speech (POS) tagging on the words to determine their grammatical category (noun, verb, adjective, adverb).
- Lemmatize each word based on its POS tag, using WordNet lemmatization.
- Join the lemmatized words back into a string.
- Apply the `simple_lemmatizer` function to each review in the 'review' column of DataFrame `df`.
- Print the first few rows of the DataFrame using df.head().

In [None]:

# Download NLTK resources (if not already downloaded)


# Initialize the WordNet lemmatizer


# Function to perform lemmatization on text
def simple_lemmatizer(text):
    # Tokenize the text into words
   
    # Perform POS tagging
    
    # Lemmatize each word based on its POS tag
    
    # Join the lemmatized words back into a string
    return 

# Test the lemmatizer function with a sample text


In [None]:
df.head()

#### T1.6: Create wordcloud for positive and negative sentiments.             (weightage - 5 marks)            (ME)


#### Note:
- Define a WordCloud object named `WC` with the width of 1000 pixels, height of 500 pixels, maximum words of 500, and minimum font size of 5.
- Generate a WordCloud for positive reviews by joining all the review text where the sentiment is positive.
- Display the generated WordCloud using matplotlib.

In [None]:
# word cloud for positive review words


- Define a method `create_wordcloud` to generate a word cloud for negative review words.
- Join all the negative review texts into a single string.
- Create a WordCloud object with specified parameters (width, height, max_words, min_font_size).
- Generate the word cloud using the negative review text.
- Display the generated word cloud using Matplotlib.

In [None]:
# word cloud for negative review words


# Task 2: Build a Neural Network Predictive Model with Randomized Search (weightage - 30 marks)      

#### T2.1: Load the cleaned dataset and divide it into predictor and target values (X & y) (weightage – 3 marks) (AE)

#### Note:
- Define a function `separate_data_and_target` to split a DataFrame into input features (X) and target variable (y).
- Inside the function, extract the 'review' column as input features (X) and the 'sentiment' column as the target variable (y).
- Return X and y as separate entities.

In [None]:
# Splitting into input features and output(target variable)
# Separate independent features and target variable
def separate_data_and_target(df):
    
    return 

- Assign the features to variable X and the target variable to variable y.
- Print the first few rows of X and by using `X.head()` and `y.head()`.

In [None]:
X, y = separate_data_and_target(df)
print(X.head())
print(y.head())

#### T2.2 Handling categorical features: Use TF-IDF vectorizer with max_features 5000 to convert into numerical features. Convert target variable, sentiment positive to 1 and negative to 0 (weightage - 2 marks)    (AE and ME)

#### Note:
- Define a function named `label_encode` that converts 'positive' sentiment to 1 and other sentiments to 0.
- Use the `map` method to apply the `label_encode` function to the 'sentiment' column of DataFrame `df`.

In [None]:
def label_encode(sentiment):

    return

In [None]:
df['sentiment'] = df['sentiment'].map(label_encode)

- Transform text data into TF-IDF vectors.
- Specify the maximum number of features to consider using the `max_features` parameter as 5000.
- Use the `fit_transform` method to transform the text data into TF-IDF vectors.
- Assign the transformed data to the variable `X_tfidf`.

In [None]:
#set max_features = 5000


- Convert a sparse matrix to a DataFrame.
- Use `pd.DataFrame()` to create a DataFrame from the sparse matrix `X_tfidf`.
- Optionally, add the 'review' column to the DataFrame using `df['review'].values`.
- Print the DataFrame.

In [None]:
# Convert the sparse matrix to a DataFrame

# Optional: Add the 'review' column to the DataFrame if needed

# Print the DataFrame


## T2.3: Split the dataset into train and test in the ratio of 80:20. (weightage – 5 marks) (ME)

#### Note:
- Write a method using the `train_test_split` function from the `sklearn.model_selection` module to split data into training and testing sets.
- Use the feature matrix `X_tfidf.toarray()` and target vector `y`. 
- Set the test size to 20% and use a random state of 42 for reproducibility. The method should return `X_train`, `X_test`, `y_train`, and `y_test` arrays.

In [None]:
# split into training and testing


#### T2.4: Train a Sequence classifier using standard Machine learning algorithm(Logistic Regression)  to classify  documents as either positive or negative . (weightage - 5 marks) (ME)

**Model versioning:**

- Save the model as ‘first_model’ to a version control system GitHub using git commands for collaboration, tracking changes, and ensuring transparency in model development.

#### Refer to the Github document from Lumen to create the repository and steps to commit 
#### Add your Github repository link below 

#### Note:
- Define a logistic regression model for classification (`lr`) using the sklearn library.
- Set regularization penalty as L2, maximum iterations as 500, regularization strength as 1, and random state as 42.
- Fit the logistic regression model (`lr_classifier`) using training data (`X_train` and `y_train`).

In [None]:
#training the model

#Fitting the model



#### T2.5: Train Multilayer Perceptron (MLP) models to classify documents as either positive or negative. (weightage - 10 marks) (ME)

**Model versioning**
-Save the model as ‘second_model’ to a version control system GitHub using git commands

#### Note:
- Import the Sequential and Dense modules from keras.models.
- Define a Sequential model using `Sequential()`.
- Add a dense layer with 16 units and ReLU activation using `model_mlp.add(Dense(...))`.
- Add another dense layer with 8 units and ReLU activation.
- Add a dense layer with 1 unit and sigmoid activation.
- Compile the model with 'rmsprop' optimizer and binary crossentropy loss using `model_mlp.compile()`.
- Train the model using `fit()`, specifying input data (`X_train`) and labels (`y_train`), batch size (10), and number of epochs (15).

In [None]:
#training the model

#Fitting the model



#### T2.6: Train a CNN with Embedding layer to classify documents as either positive or negative.  (weightage-5 marks)(ME)

**Model versioning**
-  Save the model as ‘third_model’ to a version control system GitHub using git commands

#### Note:
- Create a CNN model for text classification.
- Import required modules from Keras and TensorFlow.
- Set the maximum number of words and sequence length.
- Use Tokenizer to tokenize and convert text to integers.
- Pad sequences to ensure they're all the same length.
- Split the data into training and testing sets.
    - Define the CNN model's architecture:
        - Include an embedding layer for word embeddings.
        - Add a Conv1D layer with 128 filters and a kernel size of 5.
        - Include a GlobalMaxPooling1D layer to reduce dimensionality.
        - Add two Dense layers with 64 and 1 units respectively.
- Compile the model using the Adam optimizer and binary crossentropy loss.
- Train the model for 10 epochs with a batch size of 32 and validate the results.

In [None]:


# Tokenize and pad sequences

# Split the Data


# Build the CNN Model


# Compile the Model

# Train the Model


# Task 3: Evaluate the performance of the model using the right evaluation metrics.                                                                         (weightage - 25 marks)

#### T3.1 Bring the models from a GitHub using git commands and evaluate the model (weightage - 2marks) (ME)

#### Model

In [None]:
#training the model

#Fitting the model


#### T3.2 Evaluate the Logistic Regression model with evaluation metrics accuracy and precision using sklearn library. (weightage-5 marks) (AE)

#### Note:

- __Function Definition:__ Define a function named evaluate_classification taking y_true (true labels) and X_test (test data).
- __Prediction:__ Predict labels using a logistic regression classifier (lr_classifier) on test data (X_test) and save the result as y_pred.
- __Evaluation Metrics:__
    * Calculate accuracy using `accuracy_score` function.
    * Calculate precision using `precision_score` function.
    * Calculate recall using `recall_score` function.
    * Calculate F1 score using `f1_score` function.
- __Storage:__ Store all metrics in a dictionary named metrics.
- __Return:__ Return the dictionary containing evaluation metrics.
 #### Ranges
* Accuracy : 0.45 - 1 (2M)
* Precision: 0.45-1 (1M) 
* Recall: 0.55-1 (1M)
* F1 Score: 0.5 -1(1M)

In [None]:
#Predicting the model
def evaluate_classification(lr_classifier,y_true, X_test):
    accuracy,precision,recall,f1 = 0.0,0.0,0.0,0.0

    return accuracy,precision,recall,f1

- Call the function with appropriate test data.

In [None]:
# call evaluate_classification
evaluate_classification(lr_classifier,y_test, X_test)

#### T3.3 Using Lime/SHAP libraries, explain the prediction of your model and give inferences. (weightage-5 marks) (ME)

#### T3.4 For the trained MLP model used, specify the accuracy score, loss value, epochs and activation function used at the output layer of the model (weightage-8 marks)(AE)

#### Added model

In [None]:
from keras.models import Sequential
from keras.layers import Dense
model_mlp = Sequential()


#### Note:

Define a method named `evaluate_mlp` to assess the performance of a __Multi-layer Perceptron (MLP) model__.
Inside the method:

- Utilize the model's evaluate function to compute the loss and accuracy using the test data (X_test and y_test).
- Determine the number of epochs by calculating the length of the loss history.
- Extract the name of the output activation function used in the last layer of the model.
- Return the computed loss, accuracy and the name of the output activation function.

Remember to:
- Input the trained MLP model, test data (X_test and y_test), and training history.
- Use the method as follows: loss, accuracy, output_activation_function = evaluate_mlp(model, X_test, y_test, history).

In [None]:
#Evaluate the Model
def evaluate_mlp(model,X_test,y_test,history):
    accuracy,loss ,output_activation_function = 0.0,0.0,None

    return accuracy,loss ,output_activation_function

- Use method evaluate_mlp to assess the MLP model's performance, then print accuracy, loss, epochs, and output activation function.

In [None]:
# Print loss, accuracy,output_activation_function 


#### T3.5 For the Trained CNN with Embedding layer, Specify the accuracy score, loss value, epochs used .(weightage-5marks) (ME)

#### CNN Model

#### Note:
 
__Tokenize and pad sequences:__
- Define the maximum number of words as 10,000 and the maximum length of sequences as 100.
- Initialize a tokenizer.
- Teach the tokenizer about the data with **`fit_on_texts(X)`**.
- Convert the texts to sequences using **`texts_to_sequences(X)`**.
- Pad the sequences to ensure they're all the same length using **`pad_sequences()`**.
 
__Split the Data:__
- Divide the data into training and testing sets with 80% for training and 20% for testing.
- Utilize **`train_test_split()`** with the defined parameters.
 
__Build the CNN Model:__
- Set the embedding dimension as 50 and the vocabulary size as the maximum words.
- Create a Sequential model.
- Add an Embedding layer with the specified parameters.
- Include a 1D Convolutional layer with 128 filters, a kernel size of 5, and ReLU activation.
- Apply GlobalMaxPooling1D to reduce the dimensionality.
- Integrate two Dense layers with 64 and 1 neuron(s), respectively, using ReLU and sigmoid activations.
 
__Compile the Model:__
- Compile the model with the Adam optimizer and binary cross-entropy loss.
- Specify **'accuracy'** as the metric for evaluation.
 
__Train the Model:__
- Train the model on the training data for 10 epochs with a batch size of 32.
- Validate the model with 10% of the training data.

In [None]:


# Tokenize and pad sequences


# Split the Data


# Build the CNN Model


# Compile the Model


# Train the Model

#### Note:
Create a method named `evaluate_cnn` to assess the performance of a __Convolutional Neural Network (CNN) model.__
- Within the method:
    - Utilize the model's evaluate function to calculate the loss and accuracy using the test data (X_test and y_test).
    - Determine the number of epochs by extracting the length of the loss history from the training history.
    - Return the computed loss, accuracy, and number of epochs.

Remember to:
- Provide the trained CNN model, test data (X_test and y_test), and training history as input parameters.
- Utilize the method like this: loss, accuracy, epochs = evaluate_cnn(model, X_test, y_test, history).

In [None]:
#Evaluate the Model
def evaluate_cnn(model,X_test,y_test,history):
    loss, accuracy,epochs = 0.0,0.0,0.0
    
    return loss, accuracy,epochs

- Invoke the evaluate_mlp method with arguments (model_cnn, X_test, y_test, history_cnn).
- Print the accuracy score, loss value and the number of epochs.

In [None]:
# Print accuracy,loss,epochs


#### T3.6 Implement the unit test case and deploy a model using Flask / Streamlit. (weightage-10 marks)(ME)

### Note:

- Import the necessary libraries: __keras.models__ for Sequential model and __keras.layers__ for Dense layers.
- Create a new Sequential model named __keras_model.__
- Add layers to the Sequential model and define their configurations (units, activation functions, input dimensions).
- Set the weights of the layers in the new model to be the same as the weights of an existing model (model).
- Save the Keras model to an HDF5 file named _'final_model.h5'.__

In [None]:
# Create a new Sequential model


# Add layers to the Sequential model and set the weights


# Set the weights of the layers in the new model


# Save the Keras model to an HDF5 file


- Import the necessary module json for working with JSON data.
- Convert the tokenizer's configuration to JSON format using the __to_json()__ method.
- Open a file named __'imdb_tokenizer.json'__ in write mode and encode it in UTF-8.
- Write the JSON data into the file using __json.dumps()__ function, ensuring non-ASCII characters are handled properly.

In [None]:
# Save tokenizer configuration to JSON file


### Task 4: Summarize the findings of the analysis and draw conclusions with PPT / PDF.                                                                                   (weightage - 15 marks) 

**Final Submission guidelines:** 
1.	Download the Jupyter notebook in the format of html. 
2.	Upload it in the lumen (UNext LMS)
3.	Take a screenshot of T3.6(Deployment) and upload it in the lumen. (UNext LMS)
4.	Summarized PPT/ PDF prepared in Task 4 to be uploaded in the lumen. (UNext LMS)