# Project :8 Working with Textual Data:  Text Classification

In the realm of entertainment, understanding audience sentiment towards movies is paramount for studios and production houses aiming to gauge the reception of their cinematic offerings. The IMDB movie review sentiment classification problem presents a crucial business challenge: accurately classifying movie reviews into positive or negative sentiments. What adds complexity to this task is the inherent variability in review lengths, the diverse vocabulary of words used, and the necessity for the model to discern the intricate long-term dependencies and contextual nuances embedded within the text.

  Create a first step document that lists the output of your exploratory analysis, any issues, or problems you may see with data that need follow-up, and some basic descriptive analysis that you think highlights important outcomes/findings from the data. Based on your findings, the next level of analysis will be charted out. Build a predictive model to classify the reviews into positive or negative. Perform a comparative study of several predictive models with various approaches and give your inferences accordingly.

**Dataset description:**  

There are 2 columns, review, and sentiment with 50000 records. ‘review’ column consists of imdb reviews. sentiment is the target variable with 2 classes (positive and negative)

**Dataset: IMDB Dataset.csv**

**Software Engineering aspect:**  

Utilize software engineering aspects while building the model using modular programming principles to organize your code into reusable functions or classes to enhance readability, maintainability, and collaboration.



**Initial Guidelines:**

1.	Ensure to follow to User Id’s provided by UNext for naming file as conventions.
2.	Create GitHub account and submit the GitHub link.
3. Task 1.5 to 2.6 may require use of GPU.
4. Learners can request for GPU based instance on demand by sending an email to `corpsupport@u-next.com`

### General Instructions

- The cells in the Jupyter notebook can be executed any number of times for testing the solution
- Refrain from modifying the boilerplate code as it may lead to unexpected behavior 
- The solution is to be written between the comments `# code starts here` and `# code ends here`
- On completing all the questions, the assessment is to be submitted on moodle for evaluation
- Before submitting the assessment, there should be `no error` while executing the notebook. If there are any error causing code, please comment it.
- The kernel of the Jupyter notebook is to be set as `Python 3 (ipykernel)` if not set already
- Include imports as necessary
- For each of the task, `Note` section will provide you hints to solve the problem.
- Do not use `PRINT` statement inside the `Except` Block. Please use `return` statement only within the except block

In [1]:
#Required imports
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag


# Task 1: Load the dataset and perform preliminary EDA with key observations and insights- (weightage - 20 marks)

#### T1.1: Load the IMDB Dataset using try and except blocks .           (weightage - 2 marks) (AE)

#### Note:
- Define a function named `load_the_dataset()` that attempts to load data from a CSV file named "IMDB Dataset.csv".
- Use a `try-except` block inside the function.
- Try to read the CSV file using `pd.read_csv()`.
- If successful, return the dataset (`df`).
- If there's an error (e.g., file not found), return the message "File not found. Please check the file path."

In [2]:
def load_the_dataset():
    try:
        df=pd.read_csv("IMDB_Dataset.csv")
        return df
    except :
        return "File not found. Please check the file path"

- After defining the function, call it to load the dataset and assign it to the variable `df`.
- Print the first few rows of the dataset using `print(df.head())`.

In [3]:
# store the result of the dataset
df=load_the_dataset()
print(df.head())

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


#### T1.2: Check for distribution of target variable(percentage)(weightage - 2 marks)  (AE)             

#### Note:
- Define a function `target_class(df)` to analyze the distribution of a target variable ('sentiment') within a DataFrame (df).
- Inside the function, calculate the proportion of each unique value in the 'sentiment' column using the `value_counts` method with `normalize=True`.
- Multiply the proportions by 100 to obtain percentages.
- Return the calculated distribution.

In [4]:
def target_class(df):
    #code starts here
    distribution=df['sentiment'].value_counts(normalize=True)*100
    #code ends
    return distribution


- Call the function with `df` as argument to get `target_distribution`.
- Print `target_distribution`.

In [5]:
# store the result
target_distribution = target_class(df)
print(target_distribution)

positive    50.0
negative    50.0
Name: sentiment, dtype: float64


#### T1.3: Clean individual reviews: Remove all punctuations from words. Remove HTML tags. Remove words between square brackets. Remove all words that are not purely alphabetical characters. Convert all words to lowercase. Use error handling technique. (Weightage 5marks)(ME)

#### Note:
- Define a function `strip_html(text)` to remove HTML tags using BeautifulSoup.
- Define a function `remove_between_square_brackets(text)` to remove text within square brackets using regex.
- Define a function `denoise_text(text)` to apply `strip_html()` and `remove_between_square_brackets()` to lowercased text. Handle exceptions using `try-except`.
- Do not use `PRINT` statement inside the `Except` Block. Please use `return` statement only within the except block
- Define a function `remove_special_characters(text, remove_digits=True)` to remove special characters using regex.
- Define a function `remove_punctuation(text)` to remove punctuation using regex.
- Apply `denoise_text()`, `remove_special_characters()`, and `remove_punctuation()` functions to the 'review' column in DataFrame `df`.

In [6]:
def strip_html(text):
    soup=BeautifulSoup(text,"html.parser")
   
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub(r'\[.*?\]','',text)

#Removing the noisy text
def denoise_text(text):
  try:
    text=strip_html(text)
    text=remove_between_square_brackets(text)
   
    return text.lower()
  except:
    return "wrong"
   

#Define function for removing special characters
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-Z\s]' if remove_digits else r'[^a-zA-Z0-9\s]'
    return re.sub(pattern,'',text)

#Apply function on review column
def remove_punctuation(text):
    # Define a regex pattern to match punctuation
    pattern=r'[^\W\S]'
   
    # Use the sub() function to replace punctuation with an empty string

    return re.sub(pattern,"",text)

In [7]:
# Assuming 'df' is your pandas DataFrame containing the IMDb dataset,
# and 'review_column' is the name of the column containing the reviews
df['review']=df['review'].apply(denoise_text)
df['review']=df['review'].apply(remove_special_characters)
df['review']=df['review'].apply(remove_punctuation)



- Print the first few rows of `df` to view the cleaned data.

In [8]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


#### T1.4: Count the number of stopwords present. Remove all words that are known stop words.  (Use nltk)  (weightage - 2 marks)(AE)        

#### Note:
- Import necessary modules from NLTK for text preprocessing and download NLTK stopwords dataset if not already downloaded.
- Define two functions: `num_stopwords(review)` and `remove_stopwords(review)`.
- Tokenize the input review using NLTK's `word_tokenize` function.
- In `num_stopwords(review)`, count the number of stopwords present in the tokenized review.
- Return the count of stopwords.
- In `remove_stopwords(review)`, remove stopwords from the tokenized review.
- Join the cleaned tokens back into a string.
- Return the cleaned review.

In [9]:

# Download stopwords if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')
def num_stopwords(review):
    # Tokenize the review
    tokens=word_tokenize(review)
    # Get English stopwords
    stop_words=set(stopwords.words('english'))
    # Count the number of stopwords present
    stopwords_count=sum(1 for word in tokens if word.lower() in stop_words)
    return stopwords_count

def remove_stopwords(review):

    # Tokenize the review
    tokens=word_tokenize(review)
    # Get English stopwords
    stop_words=set(stopwords.words('english'))
    # Remove stopwords
    cleaned_tokens=[word for word in tokens if word.lower() if word.lower() not in stop_words]
    # Join tokens back into a cleaned review
    cleaned_review=' '.join(cleaned_tokens)
    return cleaned_review

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/labuser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/labuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


- Apply the `num_stopwords` method to each review in the 'review' column using the `apply` function.
- Sum the total number of stopwords across all reviews.

In [10]:
# Apply function to the specified column in the DataFrame
df['num_stopwords']=df['review'].apply(num_stopwords)


- Apply the `remove_stopwords` function to the 'review' column of DataFrame `df`.

In [11]:
# remove stopwords
df['review']=df['review'].apply(remove_stopwords)

- Remove the 'num_stopwords' column from the DataFrame `df`.

In [12]:
# Drop the num_stopwords column

df=df.drop(columns=['num_stopwords'])

In [13]:
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching oz episode yo...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive


#### __Task 1.5 to 2.6 may require use of GPU.__
#### Learners can request for GPU based instance on demand by sending an email to `corpsupport@u-next.com`.

## T1.5: Remove all words that have a length of 1 character. Count the number of such words removed.Perform lemmatization to reduce words to their base form. (weightage – 4 marks)   (AE) & (ME)       

#### Note:
- Define a function `count_short_words(review)` to count short words in a review.
- Tokenize the review into words.
- Count words with a length of 1.
- Return the count of short words.

In [14]:
def count_short_words(review):#AE
    # Split the review into tokens
    tokens=word_tokenize(review)
    
    # Count words with length 1
    short_words_count=sum(1 for word in tokens if len(word)==1)
    
    return short_words_count

- Define a function `count_short_words_apply` to count short words in a specified column of a DataFrame.
- Apply the `count_short_words` function to the specified column in the DataFrame and sum up the counts of short words.
- Return the total count of short words.

In [15]:
def count_short_words_apply(df, column_name):#AE
    # Apply count_short_words function to the specified column in the DataFrame
    total_short_words=df[column_name].apply(count_short_words).sum()
   
    return total_short_words

- Define a function called `remove_shortwords` to filter out short words from a given review.
- Tokenize the review into individual words using `word_tokenize`.
- Create a new list (`new`) containing only words with a length greater than 1, Join the filtered words back into a single string and return the filtered review string.

In [16]:
def remove_shortwords(review):#ME
    tokens=word_tokenize(review)
    filtered_tokens=[word for word in tokens if len(word)>1]
    filtered_review=' '.join(filtered_tokens)
    return filtered_review

- Apply the function `count_short_words_apply` to the DataFrame `df` using the column 'review'.
- Print the total number of words with a length of 1 character.

In [17]:
total_short_words = count_short_words_apply(df, 'review')#AE
print("Total number of words with length = 1 character:", total_short_words)

Total number of words with length = 1 character: 6018


- Apply the `remove_shortwords` function to the 'review' column in DataFrame `df`.

In [18]:
# Apply remove_shortwords
df['review']=df['review'].apply(remove_shortwords)

- Define a function `simple_lemmatizer` to perform lemmatization on text data.
- Tokenize the input text into individual words.
- Perform part-of-speech (POS) tagging on the words to determine their grammatical category (noun, verb, adjective, adverb).
- Lemmatize each word based on its POS tag, using WordNet lemmatization.
- Join the lemmatized words back into a string.
- Apply the `simple_lemmatizer` function to each review in the 'review' column of DataFrame `df`.
- Print the first few rows of the DataFrame using df.head().

In [19]:

# Download NLTK resources (if not already downloaded)
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')


# Initialize the WordNet lemmatizer
lemmatizer=WordNetLemmatizer()


# Function to perform lemmatization on text
def simple_lemmatizer(text):
    # Tokenize the text into words
    tokens=word_tokenize(text)
   
    # Perform POS tagging
    tagged_tokens=pos_tag(tokens)
    
    # Lemmatize each word based on its POS tag
    lemmatized_words=[]
    for token,tag in tagged_tokens:
        if tag.startswith('J'):
            pos=wordnet.ADJ
        elif tag.startswith('V'):
            pos=wordnet.VERB
        elif tag.startswith('N'):
            pos=wordnet.NOUN
        elif tag.startswith('R'):
            pos=wordnet.ADV
        else:
            pos=wordnet.NOUN
        lemmatized_word=lemmatizer.lemmatize(token,pos)
        lemmatized_words.append(lemmatized_word)
    lemmatized_text=' '.join(lemmatized_words)
    # Join the lemmatized words back into a string
    return lemmatized_text

# Test the lemmatizer function with a sample text
print(simple_lemmatizer('playing played'))
df['review']=df['review'].apply(simple_lemmatizer)


[nltk_data] Downloading package wordnet to /home/labuser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/labuser/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/labuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


play play


In [20]:
df.head()

Unnamed: 0,review,sentiment
0,one reviewer mention watch oz episode youll ho...,positive
1,wonderful little production film technique una...,positive
2,think wonderful way spend time hot summer week...,positive
3,basically there family little boy jake think t...,negative
4,petter matteis love time money visually stunni...,positive


#### T1.6: Create wordcloud for positive and negative sentiments.             (weightage - 5 marks)            (ME)


#### Note:
- Define a WordCloud object named `WC` with the width of 1000 pixels, height of 500 pixels, maximum words of 500, and minimum font size of 5.
- Generate a WordCloud for positive reviews by joining all the review text where the sentiment is positive.
- Display the generated WordCloud using matplotlib.

In [21]:
# word cloud for positive review words


- Define a method `create_wordcloud` to generate a word cloud for negative review words.
- Join all the negative review texts into a single string.
- Create a WordCloud object with specified parameters (width, height, max_words, min_font_size).
- Generate the word cloud using the negative review text.
- Display the generated word cloud using Matplotlib.

In [22]:
# word cloud for negative review words


# Task 2: Build a Neural Network Predictive Model with Randomized Search (weightage - 30 marks)      

#### T2.1: Load the cleaned dataset and divide it into predictor and target values (X & y) (weightage – 3 marks) (AE)

#### Note:
- Define a function `separate_data_and_target` to split a DataFrame into input features (X) and target variable (y).
- Inside the function, extract the 'review' column as input features (X) and the 'sentiment' column as the target variable (y).
- Return X and y as separate entities.

In [23]:
# Splitting into input features and output(target variable)
# Separate independent features and target variable
def separate_data_and_target(df):
    # Extract the 'review' column as input features (X)
    X = df['review']
    # Extract the 'sentiment' column as the target variable (y)
    y = df['sentiment']
    # Return X and y
    return X, y 

- Assign the features to variable X and the target variable to variable y.
- Print the first few rows of X and by using `X.head()` and `y.head()`.

In [24]:
X, y = separate_data_and_target(df)
print(X.head())
print(y.head())

0    one reviewer mention watch oz episode youll ho...
1    wonderful little production film technique una...
2    think wonderful way spend time hot summer week...
3    basically there family little boy jake think t...
4    petter matteis love time money visually stunni...
Name: review, dtype: object
0    positive
1    positive
2    positive
3    negative
4    positive
Name: sentiment, dtype: object


#### T2.2 Handling categorical features: Use TF-IDF vectorizer with max_features 5000 to convert into numerical features. Convert target variable, sentiment positive to 1 and negative to 0 (weightage - 2 marks)    (AE and ME)

#### Note:
- Define a function named `label_encode` that converts 'positive' sentiment to 1 and other sentiments to 0.
- Use the `map` method to apply the `label_encode` function to the 'sentiment' column of DataFrame `df`.

In [25]:
def label_encode(sentiment):
    if sentiment == 'positive':
        p=1
    else:
        p=0
    return p

In [26]:
df['sentiment'] = df['sentiment'].map(label_encode)
y = df['sentiment']

- Transform text data into TF-IDF vectors.
- Specify the maximum number of features to consider using the `max_features` parameter as 5000.
- Use the `fit_transform` method to transform the text data into TF-IDF vectors.
- Assign the transformed data to the variable `X_tfidf`.

In [27]:
#set max_features = 5000
from sklearn.feature_extraction.text import TfidfVectorizer
# Step 1: Set the maximum number of features for TF-IDF
max_features = 5000

# Step 2: Initialize the TfidfVectorizer with max_features
tfidf_vectorizer = TfidfVectorizer(max_features=max_features)

# Step 3: Transform the text data into TF-IDF vectors
X_tfidf = tfidf_vectorizer.fit_transform(df['review'])




- Convert a sparse matrix to a DataFrame.
- Use `pd.DataFrame()` to create a DataFrame from the sparse matrix `X_tfidf`.
- Optionally, add the 'review' column to the DataFrame using `df['review'].values`.
- Print the DataFrame.

In [28]:
# Convert the sparse matrix to a DataFrame
X_tfidf_df =pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Optional: Add the 'review' column to the DataFrame if needed
#X_tfidf_df['review'] = df['review'].values

# Print the DataFrame

print("TF-IDF DataFrame:")
print(X_tfidf_df.head())


TF-IDF DataFrame:
   aaron  abandon  abc  ability  able  aboard  abraham  abrupt  abruptly  \
0    0.0      0.0  0.0      0.0   0.0     0.0      0.0     0.0       0.0   
1    0.0      0.0  0.0      0.0   0.0     0.0      0.0     0.0       0.0   
2    0.0      0.0  0.0      0.0   0.0     0.0      0.0     0.0       0.0   
3    0.0      0.0  0.0      0.0   0.0     0.0      0.0     0.0       0.0   
4    0.0      0.0  0.0      0.0   0.0     0.0      0.0     0.0       0.0   

   absence  ...     youll     young  youngster     youre  youth  youve  zero  \
0      0.0  ...  0.061217  0.000000        0.0  0.000000    0.0    0.0   0.0   
1      0.0  ...  0.000000  0.000000        0.0  0.000000    0.0    0.0   0.0   
2      0.0  ...  0.000000  0.076121        0.0  0.000000    0.0    0.0   0.0   
3      0.0  ...  0.000000  0.000000        0.0  0.080674    0.0    0.0   0.0   
4      0.0  ...  0.000000  0.000000        0.0  0.000000    0.0    0.0   0.0   

     zombie  zone  zoom  
0  0.000000   0.0 

## T2.3: Split the dataset into train and test in the ratio of 80:20. (weightage – 5 marks) (ME)

#### Note:
- Write a method using the `train_test_split` function from the `sklearn.model_selection` module to split data into training and testing sets.
- Use the feature matrix `X_tfidf.toarray()` and target vector `y`. 
- Set the test size to 20% and use a random state of 42 for reproducibility. The method should return `X_train`, `X_test`, `y_train`, and `y_test` arrays.

pandas.core.series.Series

In [35]:
# split into training and testing
from sklearn.model_selection import train_test_split
def split_train_test(X, y, test_size=0.2, random_state=42):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test

# Use the method to split the data
X_train, X_test, y_train, y_test = split_train_test(X_tfidf.toarray(), y)

# Print shapes to verify the splits
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (40000, 5000)
X_test shape: (10000, 5000)
y_train shape: (40000,)
y_test shape: (10000,)


#### T2.4: Train a Sequence classifier using standard Machine learning algorithm(Logistic Regression)  to classify  documents as either positive or negative . (weightage - 5 marks) (ME)

**Model versioning:**

- Save the model as ‘first_model’ to a version control system GitHub using git commands for collaboration, tracking changes, and ensuring transparency in model development.

#### Refer to the Github document from Lumen to create the repository and steps to commit 
#### Add your Github repository link below 

#### Note:
- Define a logistic regression model for classification (`lr`) using the sklearn library.
- Set regularization penalty as L2, maximum iterations as 500, regularization strength as 1, and random state as 42.
- Fit the logistic regression model (`lr_classifier`) using training data (`X_train` and `y_train`).

In [38]:
from sklearn.linear_model import LogisticRegression
import pickle

#training the model
lr_classifier = LogisticRegression(penalty='l2', max_iter=500, C=1, random_state=42)

#Fitting the model
lr_classifier.fit(X_train, y_train)

# saving model
#with open('first_model.pkl', 'wb') as file:
    #pickle.dump(lr_classifier, file)




#### T2.5: Train Multilayer Perceptron (MLP) models to classify documents as either positive or negative. (weightage - 10 marks) (ME)

**Model versioning**
-Save the model as ‘second_model’ to a version control system GitHub using git commands

#### Note:
- Import the Sequential and Dense modules from keras.models.
- Define a Sequential model using `Sequential()`.
- Add a dense layer with 16 units and ReLU activation using `model_mlp.add(Dense(...))`.
- Add another dense layer with 8 units and ReLU activation.
- Add a dense layer with 1 unit and sigmoid activation.
- Compile the model with 'rmsprop' optimizer and binary crossentropy loss using `model_mlp.compile()`.
- Train the model using `fit()`, specifying input data (`X_train`) and labels (`y_train`), batch size (10), and number of epochs (15).

In [39]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import RMSprop

model_mlp = Sequential()

#training the model
# Add a dense layer with 16 units and ReLU activation
model_mlp.add(Dense(16, activation='relu', input_shape=(X_train.shape[1],)))

# Add a dense layer with 8 units and ReLU activation
model_mlp.add(Dense(8, activation='relu'))

# Add a dense layer with 1 unit and sigmoid activation
model_mlp.add(Dense(1, activation='sigmoid'))

# Compile the model with 'rmsprop' optimizer and binary crossentropy loss
model_mlp.compile(optimizer=RMSprop(),loss='binary_crossentropy', metrics=['accuracy'])


#Fitting the model
model_mlp.fit(X_train, y_train, epochs=15, batch_size=10)

 #saving model
#with open('second_model.pkl', 'wb') as file:
    #pickle.dump(model_mlp,file)




2024-06-30 16:34:34.649701: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-30 16:34:34.652390: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-30 16:34:34.697965: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-30 16:34:34.698279: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f560cf5d490>

#### T2.6: Train a CNN with Embedding layer to classify documents as either positive or negative.  (weightage-5 marks)(ME)

**Model versioning**
-  Save the model as ‘third_model’ to a version control system GitHub using git commands

#### Note:
- Create a CNN model for text classification.
- Import required modules from Keras and TensorFlow.
- Set the maximum number of words and sequence length.
- Use Tokenizer to tokenize and convert text to integers.
- Pad sequences to ensure they're all the same length.
- Split the data into training and testing sets.
    - Define the CNN model's architecture:
        - Include an embedding layer for word embeddings.
        - Add a Conv1D layer with 128 filters and a kernel size of 5.
        - Include a GlobalMaxPooling1D layer to reduce dimensionality.
        - Add two Dense layers with 64 and 1 units respectively.
- Compile the model using the Adam optimizer and binary crossentropy loss.
- Train the model for 10 epochs with a batch size of 32 and validate the results.

In [40]:
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding,Conv1D,GlobalMaxPooling1D,Dense
from keras.optimizers import Adam

# Set the maximum number of words and sequence length
max_words = 10000  # Maximum number of words to consider in the vocabulary
max_len = 100      # Maximum sequence length


# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df['review'])
sequences = tokenizer.texts_to_sequences(df['review'])
padded_sequences = pad_sequences(sequences, maxlen=max_len)

# Define target variable
y = df['sentiment']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(padded_sequences,y,test_size=0.2,random_state=42)

# Build the CNN model
model_cnn = Sequential()
# Add an embedding layer
model_cnn.add(Embedding(input_dim=max_words, output_dim=50, input_length=max_len))
# Add a Conv1D layer with 128 filters and kernel size of 5
model_cnn.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
# Add a GlobalMaxPooling1D layer
model_cnn.add(GlobalMaxPooling1D())
# Add a Dense layer with 64 units and ReLU activation
model_cnn.add(Dense(64, activation='relu'))
# Add a Dense layer with 1 unit and sigmoid activation for binary classification
model_cnn.add(Dense(1, activation='sigmoid'))

# Compile the model using the Adam optimizer and binary crossentropy loss
model_cnn.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model for 10 epochs with a batch size of 32 and validate the results
model_cnn.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)


# Save the trained CNN model as 'third_model.h5'
#with open('third_model.pkl','wb') as file:
    #pickle.dump(model_cnn,file)


Epoch 1/10


  return t[start:end]


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f55e4f2c460>

# Task 3: Evaluate the performance of the model using the right evaluation metrics.                                                                         (weightage - 25 marks)

#### T3.1 Bring the models from a GitHub using git commands and evaluate the model (weightage - 2marks) (ME)

#### Model

In [42]:
import pickle
#training the model
with open('first_model.pkl','rb') as f:
          lr_classifier=pickle.load(f)
with open('second_model.pkl','rb') as f:
          model_mlp=pickle.load(f)
#with open('third_model.pkl','rb') as f:
          #model_cnn=pickle.load(f)

#Fitting the model


AttributeError: 'RMSprop' object has no attribute 'build'

#### T3.2 Evaluate the Logistic Regression model with evaluation metrics accuracy and precision using sklearn library. (weightage-5 marks) (AE)

#### Note:

- __Function Definition:__ Define a function named evaluate_classification taking y_true (true labels) and X_test (test data).
- __Prediction:__ Predict labels using a logistic regression classifier (lr_classifier) on test data (X_test) and save the result as y_pred.
- __Evaluation Metrics:__
    * Calculate accuracy using `accuracy_score` function.
    * Calculate precision using `precision_score` function.
    * Calculate recall using `recall_score` function.
    * Calculate F1 score using `f1_score` function.
- __Storage:__ Store all metrics in a dictionary named metrics.
- __Return:__ Return the dictionary containing evaluation metrics.
 #### Ranges
* Accuracy : 0.45 - 1 (2M)
* Precision: 0.45-1 (1M) 
* Recall: 0.55-1 (1M)
* F1 Score: 0.5 -1(1M)

In [79]:
from sklearn.metrics import accuracy_score,precison_score,recall_score,f1_score
#Predicting the model
def evaluate_classification(lr_classifier,y_true, X_test):
    accuracy,precision,recall,f1 = 0.0,0.0,0.0,0.0
    y_pred=lr_classifier.predict(X_test)
    precison_score(y_true,y_pred)
    recall=recall_score(y_true,y_pred)
    f1=f1_score(y_true,y_pred)
    
    return accuracy,precision,recall,f1

- Call the function with appropriate test data.

In [80]:
# call evaluate_classification
evaluate_classification(lr_classifier,y_test, X_test)

NameError: name 'lr_classifier' is not defined

#### T3.3 Using Lime/SHAP libraries, explain the prediction of your model and give inferences. (weightage-5 marks) (ME)

#### T3.4 For the trained MLP model used, specify the accuracy score, loss value, epochs and activation function used at the output layer of the model (weightage-8 marks)(AE)

#### Added model

In [None]:
from keras.models import Sequential
from keras.layers import Dense
model_mlp = Sequential()


#### Note:

Define a method named `evaluate_mlp` to assess the performance of a __Multi-layer Perceptron (MLP) model__.
Inside the method:

- Utilize the model's evaluate function to compute the loss and accuracy using the test data (X_test and y_test).
- Determine the number of epochs by calculating the length of the loss history.
- Extract the name of the output activation function used in the last layer of the model.
- Return the computed loss, accuracy and the name of the output activation function.

Remember to:
- Input the trained MLP model, test data (X_test and y_test), and training history.
- Use the method as follows: loss, accuracy, output_activation_function = evaluate_mlp(model, X_test, y_test, history).

In [None]:
#Evaluate the Model
def evaluate_mlp(model,X_test,y_test,history):
    accuracy,loss ,output_activation_function = 0.0,0.0,None

    return accuracy,loss ,output_activation_function

- Use method evaluate_mlp to assess the MLP model's performance, then print accuracy, loss, epochs, and output activation function.

In [None]:
# Print loss, accuracy,output_activation_function 


#### T3.5 For the Trained CNN with Embedding layer, Specify the accuracy score, loss value, epochs used .(weightage-5marks) (ME)

#### CNN Model

#### Note:
 
__Tokenize and pad sequences:__
- Define the maximum number of words as 10,000 and the maximum length of sequences as 100.
- Initialize a tokenizer.
- Teach the tokenizer about the data with **`fit_on_texts(X)`**.
- Convert the texts to sequences using **`texts_to_sequences(X)`**.
- Pad the sequences to ensure they're all the same length using **`pad_sequences()`**.
 
__Split the Data:__
- Divide the data into training and testing sets with 80% for training and 20% for testing.
- Utilize **`train_test_split()`** with the defined parameters.
 
__Build the CNN Model:__
- Set the embedding dimension as 50 and the vocabulary size as the maximum words.
- Create a Sequential model.
- Add an Embedding layer with the specified parameters.
- Include a 1D Convolutional layer with 128 filters, a kernel size of 5, and ReLU activation.
- Apply GlobalMaxPooling1D to reduce the dimensionality.
- Integrate two Dense layers with 64 and 1 neuron(s), respectively, using ReLU and sigmoid activations.
 
__Compile the Model:__
- Compile the model with the Adam optimizer and binary cross-entropy loss.
- Specify **'accuracy'** as the metric for evaluation.
 
__Train the Model:__
- Train the model on the training data for 10 epochs with a batch size of 32.
- Validate the model with 10% of the training data.

In [None]:


# Tokenize and pad sequences


# Split the Data


# Build the CNN Model


# Compile the Model


# Train the Model

#### Note:
Create a method named `evaluate_cnn` to assess the performance of a __Convolutional Neural Network (CNN) model.__
- Within the method:
    - Utilize the model's evaluate function to calculate the loss and accuracy using the test data (X_test and y_test).
    - Determine the number of epochs by extracting the length of the loss history from the training history.
    - Return the computed loss, accuracy, and number of epochs.

Remember to:
- Provide the trained CNN model, test data (X_test and y_test), and training history as input parameters.
- Utilize the method like this: loss, accuracy, epochs = evaluate_cnn(model, X_test, y_test, history).

In [None]:
#Evaluate the Model
def evaluate_cnn(model,X_test,y_test,history):
    loss, accuracy,epochs = 0.0,0.0,0.0
    
    return loss, accuracy,epochs

- Invoke the evaluate_mlp method with arguments (model_cnn, X_test, y_test, history_cnn).
- Print the accuracy score, loss value and the number of epochs.

In [None]:
# Print accuracy,loss,epochs


#### T3.6 Implement the unit test case and deploy a model using Flask / Streamlit. (weightage-10 marks)(ME)

### Note:

- Import the necessary libraries: __keras.models__ for Sequential model and __keras.layers__ for Dense layers.
- Create a new Sequential model named __keras_model.__
- Add layers to the Sequential model and define their configurations (units, activation functions, input dimensions).
- Set the weights of the layers in the new model to be the same as the weights of an existing model (model).
- Save the Keras model to an HDF5 file named _'final_model.h5'.__

In [None]:
# Create a new Sequential model


# Add layers to the Sequential model and set the weights


# Set the weights of the layers in the new model


# Save the Keras model to an HDF5 file


- Import the necessary module json for working with JSON data.
- Convert the tokenizer's configuration to JSON format using the __to_json()__ method.
- Open a file named __'imdb_tokenizer.json'__ in write mode and encode it in UTF-8.
- Write the JSON data into the file using __json.dumps()__ function, ensuring non-ASCII characters are handled properly.

In [None]:
# Save tokenizer configuration to JSON file


### Task 4: Summarize the findings of the analysis and draw conclusions with PPT / PDF.                                                                                   (weightage - 15 marks) 

**Final Submission guidelines:** 
1.	Download the Jupyter notebook in the format of html. 
2.	Upload it in the lumen (UNext LMS)
3.	Take a screenshot of T3.6(Deployment) and upload it in the lumen. (UNext LMS)
4.	Summarized PPT/ PDF prepared in Task 4 to be uploaded in the lumen. (UNext LMS)