# 🧑‍🏫 Task 1 Part 2: Build Your Own Logistic Regression Model for Sentiment Analysis
In this exercise, you will build a **logistic regression model** from scratch to perform sentiment analysis.

**Objective:** Implement all key components of an ML pipeline (except for data handling).

**Allowed Libraries:** `pandas`, `numpy`

**Not Allowed:** Any pre-built ML algorithms or functions like `LogisticRegression` from `sklearn`.

Follow the instructions step-by-step and answer the questions!

## Step 1: Load the Data
**Task:** Use `pandas` to load the dataset from a file named `IMDB_reviews.csv`.

> **Hint:** Use `pd.read_csv()` to load the file and display the first 5 rows.

**Question:** What are the key features and the target variable in this dataset?

In [11]:
# Load the dataset and display the first few rows
import pandas as pd
f=pd.read_csv('IMDB_Dataset.csv')


['review', 'sentiment']

## Step 3: Tokenization and Text Cleaning
**Task:** Implement your own function to:
1. Convert all text to lowercase.
2. Remove punctuation and special characters.
3. Split the text into words (tokenization).

> **Hint:** Use Python string methods and list comprehensions.

**Question:** Why is tokenization important for text-based models?

In [3]:
#Write your own tokenizer function
def process(text):
    text=text.lower()
    return list(text.split())
process(f['review'][0])


['one',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching',
 'just',
 '1',
 'oz',
 'episode',
 "you'll",
 'be',
 'hooked.',
 'they',
 'are',
 'right,',
 'as',
 'this',
 'is',
 'exactly',
 'what',
 'happened',
 'with',
 'me.<br',
 '/><br',
 '/>the',
 'first',
 'thing',
 'that',
 'struck',
 'me',
 'about',
 'oz',
 'was',
 'its',
 'brutality',
 'and',
 'unflinching',
 'scenes',
 'of',
 'violence,',
 'which',
 'set',
 'in',
 'right',
 'from',
 'the',
 'word',
 'go.',
 'trust',
 'me,',
 'this',
 'is',
 'not',
 'a',
 'show',
 'for',
 'the',
 'faint',
 'hearted',
 'or',
 'timid.',
 'this',
 'show',
 'pulls',
 'no',
 'punches',
 'with',
 'regards',
 'to',
 'drugs,',
 'sex',
 'or',
 'violence.',
 'its',
 'is',
 'hardcore,',
 'in',
 'the',
 'classic',
 'use',
 'of',
 'the',
 'word.<br',
 '/><br',
 '/>it',
 'is',
 'called',
 'oz',
 'as',
 'that',
 'is',
 'the',
 'nickname',
 'given',
 'to',
 'the',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentar

## Step 4: Create a Vocabulary
**Task:** Create a **vocabulary** (a list of unique words) from the tokenized dataset.

> **Hint:** Use a set to store unique words, then convert it to a list.

**Question:** How does vocabulary size affect model performance?

## Step 5: Implement Word Count
**Task:** Calculate and store the number of times each word appears in a particular review for all reviews

In [4]:
# Your code here
# Example: Write functions to calculate word counts
def wordcount(text):
    ans={}
    for i in text:
        if i in ans:
            ans[i]+=1
        else:
            ans[i]=1
    return ans

    

## Step 6: Train-Test Split
**Task:** Split the data into **80% training** and **20% testing** sets.

> **Hint:** Use `numpy` or list slicing to split the data manually.

**Question:** Why do we need to split the data for training and testing?

In [1]:
# Your code here



## Step 7: Building the Logistic Regression Model (Divided Steps)

### Part 1: The Prediction functions
The **prediction function** returns the predicted value of the data point using the weights and the bias. It uses the sigmoid function to convert the prediction into a value in the range of 0 to 1.

**Task:** Implement the sigmoid and prediction functions

In [1]:
import math
def sigmoid(x):
    return 1/(1-math.pow(math.e,x))

import numpy as np
def lr_prediction(weights,	bias,	features):
    a=np.dot(weights,features)+bias
    return sigmoid(a)

### Part 2: Implementing the Error functions
**Task:** Use the gradient update rules to train the logistic regression model over multiple epochs.

In [2]:
def	log_loss(weights,	bias,	features,	label):
    func=lr_prediction(weights,	bias,	features)
    cost=  -label*np.log(func) - (1-label)*np.log(1-func)
    return cost

def	total_log_loss(weights,	bias,	X,	y):
    total=0
    m=X.shape[0]
    for i in range(m):
        total+=log_loss(weights,	bias,	X[i],	y[i])
    return total

### Part 1: Update Weights
The **Update_Weights** adjusts weights and bias based on whether points are correctly or incorrectly classified, It is a simple method of improving the model at every iteration:
1. **Correctly classified points:** Move the line **away** from the point.
2. **Incorrectly classified points:** Move the line **towards** the point.

**Task:** Implement the gradient update function based on these rules.

In [None]:
#Your Code
def	lr_update_weights(weights,	bias,	features,	label,	learning_rate	=	0.01):
    
    return  

### Part 2: Implementing the Logistic Regression Algorithm
**Task:** Use the function to update weights to train the logistic regression model over multiple epochs. Keep track of the total error for each epoch. You will later plot these errors.

In [None]:
# Implement the logistic regression model 
def	lr_algorithm(features,	labels,	learning_rate	=	0.01,	epochs	=	200):
    return

## Step 8: Evaluate Your Model
**Task:** Calculate the accuracy of the model. Compare the predicted labels with the actual labels.

> **Hint:** Use the formula for accuracy: (Correct Predictions / Total Predictions) * 100

**Question:** Which metric—accuracy, precision, or recall—is most important for sentiment analysis?

In [None]:
# Your code here


## Step 8: Visualize the Errors  
**Task:** Create a scatter plot of the total errors over the training epochs. The plot should show a gradual decrease in errors, stabilizing as the model converges.

In [None]:
#Your code here

## Step 9: Make Predictions on New Data
**Task:** Use your trained model to predict the sentiment of the following review:

> _"The movie was absolutely fantastic and kept me hooked till the end."_

**Question:** What challenges might arise when predicting on new data?

In [None]:
# Your code here


## Step 10: Wrap-up
1. How well did your model perform?
2. What challenges did you face while implementing it from scratch?
3. What improvements would you suggest for the future?

### Notes (if any):