# 🧑‍🏫 Task 1 Part 2: Build Your Own Logistic Regression Model for Sentiment Analysis
In this exercise, you will build a **logistic regression model** from scratch to perform sentiment analysis.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like `LogisticRegression` from `sklearn`.

Follow the instructions step-by-step and answer the questions!

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/ml-trek-imdb/IMDB_Dataset.csv


## Step 1: Load the Data
**Task:** Use `pandas` to load the dataset from a file named `IMDB_reviews.csv`.

> **Hint:** Use `pd.read_csv()` to load the file and display the first 5 rows.

**Question:** What are the key features and the target variable in this dataset?

The key features and the review and the target is te sentiment

In [2]:
data = pd.read_csv("/kaggle/input/ml-trek-imdb/IMDB_Dataset.csv")

In [3]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


## Step 3: Tokenization and Text Cleaning
**Task:** Implement your own function to:
1. Convert all text to lowercase.
2. Remove punctuation and special characters.
3. Split the text into words (tokenization).

> **Hint:** Use Python string methods and list comprehensions.

**Question:** Why is tokenization important for text-based models?

In [5]:
import re

def tokenize_clean(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = text.split()
    
    return tokens


In [7]:
tokenize_clean(data['review'][0])[0:10]


['one',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching']

transforms raw text into individual units, typically words, that can be processed by the model. 
tokenization helps in creating a structured, consistent input for the model, making it possible to analyze word frequency, context, and sentiment patterns effectively.

## Step 4: Create a Vocabulary
**Task:** Create a **vocabulary** (a list of unique words) from the tokenized dataset.

> **Hint:** Use a set to store unique words, then convert it to a list.

**Question:** How does vocabulary size affect model performance?

In [6]:
vocabulary = set()
for review in data['review']:
    tokens = tokenize_clean(review)
    vocabulary.update(tokens)
    
vocabulary = sorted(list(vocabulary))

In [7]:
vocabulary[:25], len(vocabulary)

(['a',
  'aa',
  'aaa',
  'aaaaaaaaaaaahhhhhhhhhhhhhh',
  'aaaaaaaargh',
  'aaaaaaah',
  'aaaaaaahhhhhhggg',
  'aaaaagh',
  'aaaaah',
  'aaaaargh',
  'aaaaarrrrrrgggggghhhhhh',
  'aaaaatchkah',
  'aaaaaw',
  'aaaahhhhhh',
  'aaaahhhhhhh',
  'aaaand',
  'aaaarrgh',
  'aaaawwwwww',
  'aaaggghhhhhhh',
  'aaagh',
  'aaah',
  'aaahhhhhhh',
  'aaahthe',
  'aaall',
  'aaand'],
 175891)

In [26]:
limited_vocab_index = {word: i for i, word in enumerate(vocabulary)}

Larger Vocabulary: It captures more details and nuances in language, potentially improving accuracy by allowing the model to learn from more unique words
a large vocabulary also increases computational requirements
might include many low-frequency or irrelevant words.

## Step 5: Implement Word Count
**Task:** Calculate and store the number of times each word appears in a particular review for all reviews

In [12]:
word_count_dict = {}

for review in data['review']:
    tokens = tokenize_clean(review)
    for word in tokens:
        if word in word_count_dict:
            word_count_dict[word] += 1
        else:
            word_count_dict[word] = 1



In [None]:
top_10_words = sorted(word_count_dict.items(), key=lambda item: item[1], reverse=True)[:10]
top_10_words

## Step 6: Train-Test Split
**Task:** Split the data into **80% training** and **20% testing** sets.

> **Hint:** Use `numpy` or list slicing to split the data manually.

**Question:** Why do we need to split the data for training and testing?

In [23]:
np.random.seed(42)

shuffled_indices = np.random.permutation(len(data))

split_index = int(0.8 * len(data))

train_indices = shuffled_indices[:split_index]
test_indices = shuffled_indices[split_index:]

In [24]:
y_train = np.array([1 if label == 'positive' else 0 for label in train_data['sentiment']])

In [30]:
X_train = np.zeros((len(train_data), len(limited_vocab_index)))

In [38]:
for i, review in enumerate(train_data['review']):
    tokens = tokenize_clean(review)
    for word in tokens:
        if word in limited_vocab_index:
            X_train[i, limited_vocab_index[word]] += 1

In [39]:
X_train

array([[5., 0., 0., ..., 0., 0., 0.],
       [5., 0., 0., ..., 0., 0., 0.],
       [3., 0., 0., ..., 0., 0., 0.],
       ...,
       [8., 0., 0., ..., 0., 0., 0.],
       [3., 0., 0., ..., 0., 0., 0.],
       [6., 0., 0., ..., 0., 0., 0.]])

In [40]:
train_data = data.iloc[train_indices]
test_data = data.iloc[test_indices]

print("Training set size:", train_data.shape)
print("Testing set size:", test_data.shape)


Training set size: (40000, 2)
Testing set size: (10000, 2)


## Step 7: Building the Logistic Regression Model (Divided Steps)

### Part 1: The Prediction functions
The **prediction function** returns the predicted value of the data point using the weights and the bias. It uses the sigmoid function to convert the prediction into a value in the range of 0 to 1.

**Task:** Implement the sigmoid and prediction functions

In [41]:
def sigmoid(x):
 return 1 / (1 + np.exp(-x))

def lr_prediction(weights,	bias,	features):
    z = np.dot(features, weights) + bias
    return sigmoid(z)

### Part 2: Implementing the Error functions
**Task:** Use the gradient update rules to train the logistic regression model over multiple epochs.

In [42]:
def	log_loss(weights,	bias,	features,	label):
    prediction = lr_prediction(weights, bias, features)
    return - (label * np.log(prediction) + (1 - label) * np.log(1 - prediction))

def	total_log_loss(weights,	bias,	X,	y):
    losses = [log_loss(weights, bias, X[i], y[i]) for i in range(len(y))]
    return np.mean(losses)

### Part 1: Update Weights
The **Update_Weights** adjusts weights and bias based on whether points are correctly or incorrectly classified, It is a simple method of improving the model at every iteration:
1. **Correctly classified points:** Move the line **away** from the point.
2. **Incorrectly classified points:** Move the line **towards** the point.

**Task:** Implement the gradient update function based on these rules.

In [43]:
def	update_weights(weights,	bias,	features,	label,	learning_rate	=	0.01):
    prediction = lr_prediction(weights, bias, features)
    error = prediction - label
    for i in range(len(weights)):
        weights[i] -= learning_rate * error * features[i]
        bias -= learning_rate * error
    return weights, bias

### Part 2: Implementing the Logistic Regression Algorithm
**Task:** Use the function to update weights to train the logistic regression model over multiple epochs. Keep track of the total error for each epoch. You will later plot these errors.

In [44]:
def lr_algorithm(features, labels, learning_rate=0.01, epochs=200):
    weights = np.zeros(features.shape[1])
    bias = 0
    error_history = []

    for epoch in range(epochs):
        for i in range(len(labels)):
            weights, bias = update_weights(weights, bias, features[i], labels[i], learning_rate)
        
        total_error = total_log_loss(weights, bias, features, labels)
        error_history.append(total_error)
    
    return weights, bias, error_historyy

In [None]:
weights, bias, error_history = lr_algorithm(X_train, y_train)


  return 1 / (1 + np.exp(-x))


## Step 8: Evaluate Your Model
**Task:** Calculate the accuracy of the model. Compare the predicted labels with the actual labels.

> **Hint:** Use the formula for accuracy: (Correct Predictions / Total Predictions) * 100

**Question:** Which metric—accuracy, precision, or recall—is most important for sentiment analysis?

In [None]:
def calculate_accuracy(weights, bias, features, labels):
    predictions = [1 if lr_prediction(weights, bias, features[i]) >= 0.5 else 0 for i in range(len(labels))]
    correct_predictions = sum([1 for i in range(len(labels)) if predictions[i] == labels[i]])
    accuracy = (correct_predictions / len(labels)) * 100
    return accuracy



sentiment analysis, precision is more important, especially in cases where false positives (incorrectly classifying negative sentiment as positive) are more impactful.

## Step 8: Visualize the Errors  
**Task:** Create a scatter plot of the total errors over the training epochs. The plot should show a gradual decrease in errors, stabilizing as the model converges.

In [None]:
import matplotlib.pyplot as plt

def plot_errors(error_history):
    plt.figure(figsize=(10, 6))
    plt.plot(range(len(error_history)), error_history, marker='o', linestyle='-', color='b')
    plt.title("Total Log Loss Over Training Epochs")
    plt.xlabel("Epoch")
    plt.ylabel("Log Loss")
    plt.show()

plot_errors(error_history)


## Step 9: Make Predictions on New Data
**Task:** Use your trained model to predict the sentiment of the following review:

> _"The movie was absolutely fantastic and kept me hooked till the end."_

**Question:** What challenges might arise when predicting on new data?

## Step 10: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

### Notes (if any):