<a href="https://colab.research.google.com/github/Shubham04689/colab_notebooks/blob/main/Text_representation_using_Sckit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Objective

- To understand several techniques in Text representation

### Dataset
   Here we will be using Movies_review data which contains 50000 reviews. The training data and testing are split evenly, 25k reviews under reviews_train and 25k under reviews_test.
Under each file first 12500 reviews are positive and remaining 12500 are negative reviews.



In [None]:
import requests
import tarfile
import os

def download_and_extract_dataset(url, extract_path='.'):
  """Downloads and extracts a dataset from a given URL.

  Args:
    url: The URL of the dataset.
    extract_path: The path to extract the dataset to.
  """

  # Download the dataset
  response = requests.get(url, stream=True)
  response.raise_for_status()

  # Save the dataset to a temporary file
  with open('temp_dataset.tar.gz', 'wb') as f:
    for chunk in response.iter_content(chunk_size=8192):
      f.write(chunk)

  # Extract the dataset
  with tarfile.open('temp_dataset.tar.gz') as tar:
    tar.extractall(extract_path)

  # Remove the temporary file
  os.remove('temp_dataset.tar.gz')

# Example usage:
dataset_url = 'https://cdn.talentsprint.com/aiml/movie_data.tar.gz'
download_and_extract_dataset(dataset_url)

### Extarct data

### Importing required packages


In [None]:
import numpy as np
import pandas as pd
import os
import re
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Read each line and append to a list
reviews_train = []

for line in open("/content/movie_data/full_train.txt", "r"):
    reviews_train.append(line.strip()) # .strip() Return a copy of the string with leading and trailing whitespace removed

reviews_test = []

for line in open("/content/movie_data/full_test.txt", "r"):
    reviews_test.append(line.strip())

In [None]:
# Read the 20000th review from train file
reviews_train[19999]

In [None]:
Replace_without_space = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(\d+)")    # All these characters in text will be removed
Replace_with_space = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")            # All these characters in text will be replaced by space
NO_SPACE = ""
SPACE = " "

def preprocess_reviews(reviews):
    reviews = [Replace_without_space.sub(NO_SPACE, line.lower()) for line in reviews]
    reviews = [Replace_with_space.sub(SPACE, line) for line in reviews]
    return np.array(reviews)

reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)

In [None]:
# Verify the 20000th review from train text file
reviews_train_clean[19999]

Give labels for the movie reviews, where first 12500 reviews are positive and remaining 12500 are negative reviews.

In [None]:
target = np.array([1 if i < 12500 else 0 for i in range(25000)])  # Labeling positive reviews as 1 and negative reviews as 0
print(target.shape, target[345], target[20000])

### CountVectorizer


Using N-grams get the consecutive words from the given text and get the feature vector using the countvectorizer for the same.

In [None]:
"""To get binary values (1 for present or 0 for absent) instead of counts of terms/tokens, give binary=True.
N-Gram range basically lets you decide the length of the sequence of consecutive words in the given text. Suppose the n-gram range = (1, 3).
Then it will pick the unigram(only single word), bigram (group of 2 consecutive words), and the trigram (group of 3 consecutive words)."""

ngram_vectorizer = CountVectorizer(binary=False, ngram_range=(1, 2))
ngram_vectorizer.fit(reviews_train_clean)                         # Tokenize and build vocab
train_vec = ngram_vectorizer.transform(reviews_train_clean)       # To get feature vector for train data
test_vec = ngram_vectorizer.transform(reviews_test_clean)         # To get feature vector for test data

#### Split the review_train data into train and test sets

Hint: Refer to[Train-Test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
# Split the train and test sets
X_train,X_test, y_train,y_test = train_test_split(train_vec, target, test_size = 0.25,random_state = 42)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

#### Apply the Decision Tree Classifier for the splitted review_train data
Note: Below code cell take some time to compile

In [None]:
# Create an object for the DecisionTreeClassifier
decisiontree = DecisionTreeClassifier()

# Fit the model and get the predictions
decisiontree.fit(X_train,y_train)

# Predict the model
predict = decisiontree.predict(X_test)

# Calculate the accuracy
accuracy_score(y_test, predict)


In [None]:
# Use the trained model to get the predictions on the review_test data
predict = decisiontree.predict(test_vec)
accuracy_score(target, predict)

### TF IDF
 tf-idf aims to represent the number of times a given word appears in a document (a movie review in our case) relative to the number of documents in the corpus that the word appears in — where, words that appear in many documents have a value closer to zero and words that appear in less documents have values closer to 1.

We have seen how to get the consecutive words using n-grams, similarly you can try without using n-grams


In [None]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(reviews_train_clean)
X_train_tfidf = tfidf_vectorizer.transform(reviews_train_clean)
X_test_tfidf = tfidf_vectorizer.transform(reviews_test_clean)

#### Split the review_train data into train and test sets

Hint: Refer to [Train-Test split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
# Split the train and test sets
X1_train, X1_test, y1_train, y1_test = train_test_split(X_train_tfidf,target,test_size=0.25, random_state= 42)


#### Apply the Decision Tree Classifier
Note: Below code cell take some time to complie

In [None]:
# Create an object of DecisionTreeClassifier
decisiontree = DecisionTreeClassifier()

# Fit the model and get the predictions
decisiontree.fit(X1_train,y1_train)

# Predict the model
predict = decisiontree.predict(X1_test)

# Calculate the accuracy
accuracy_score(y1_test, predict)


In [None]:
# Use the trained model to get the predictions on the review_test data
predict = decisiontree.predict(X_test_tfidf)
accuracy_score(target, predict)