In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/movie-reviews/movie_reviews.csv
/kaggle/input/usa-housing/USA_Housing.csv


### **Q1: Based on Loading corpus**
* Load the dataset into a Pandas DataFrame
* Load first 100 reviews from the review column of DataFrame into a list of strings called as ‘corpus’

In [2]:
df=pd.read_csv('../input/movie-reviews/movie_reviews.csv')

In [3]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [4]:
corpus=[df['review'].iloc[i] for i in range(100)]

In [5]:
df.review[99]

"I have been a Mario fan for as long as I can remember, I have very fond memories of playing Super Mario World as a kid, this game has brought back many of those memories while adding something new. Super Mario Galaxy is the latest installment in the amazing Mario franchise. There is much very different about this game from any other Mario before it, while still keeping intact the greatest elements of Mario, the first noticeable difference is that the story takes place in space.<br /><br />The story begins much like any other Mario game, Mario receives a letter from Princess Peach inviting him to a celebration at her castle in the Mushroom Kingdom. Upon arriving at Peach's castle Mario finds Bowser and his son (Bowser Jr.) attacking the castle with their airships. Bowser kidnaps Princess Peach and then lifts her castle up into space. In the midst of the castle being lifted into space Mario falls off and lands on an unknown planet. Mario is found by a talking star named Luma and is take

In [6]:
corpus

["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the f

### **Q2: Based on Pre-processing Corpus**
#### Pre-process each document (i.e. each string of corpus list) so that all **words are in lower case**, there
#### is no special symbols, url, numbers, or stopwords. Also, each word of the document is stemmed to
#### its root word using **Porter Stemmer**

# Pre-processing: Step 1: Normalization

Normalization in text includes following steps:
1. Converting the text into same case (lower, upper, or proper case)
2. Removing numbers, special symbols, urls from text.


In [7]:
lower=[]
for strings in corpus:
    lower.append(' '.join([word.lower() for word in strings.split()]))
    

In [8]:
lower

["one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.<br /><br />the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.<br /><br />it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />i would say the main appeal of the show is due to the f

In [9]:
alpha=[]
for strings in lower:
    alpha.append(' '.join([word for word in strings.split() if word.isalpha()]))

In [10]:
# alpha[0].split()

# Pre-processing Step 2: Tokenization

Tokenization involves converting each document as list of words. It can be done in two ways:
1. .split() method of list
2. word_tokenize method of nltk.tokenize

In [11]:
tokenize=[]
for strings in alpha:
    tokenize.append(strings.split())
tokenize[1][:4]   

['a', 'wonderful', 'little', 'filming']

In [12]:
from nltk.tokenize import word_tokenize
tokenize=[]
for strings in alpha:
    tokenize.append(word_tokenize(strings))

# Pre-processing Step 3: Stop-word Removal
A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that does not have any linguistic importance in NLP applications

NLTK(Natural Language Toolkit) in python has a list of stopwords stored in stopwords corpus in 16 different languages.

The name of fields is the name of language.

In [13]:
import nltk
from nltk.corpus import *
stopwords=stopwords.words('english')

In [14]:
stopwords[:5]

['i', 'me', 'my', 'myself', 'we']

In [15]:
no_stop=[]
for list1 in tokenize:
    no_stop.append([word for word in list1 if word not in stopwords])
no_stop[1][:5]    

['wonderful', 'little', 'filming', 'technique', 'fashion']

In [16]:
final=[]
from nltk.stem import PorterStemmer
ps=PorterStemmer()
for string in no_stop:
    final.append( ' '.join([ps.stem(word) for word in string]) )
    

In [17]:
final[1]

'wonder littl film techniqu fashion give sometim sens realism entir actor extrem well michael sheen got voic pat truli see seamless edit guid refer diari well worth watch terrificli written perform master product one great comedi realism realli come home littl fantasi guard rather use tradit techniqu remain solid play knowledg particularli scene concern orton halliwel set flat mural decor everi terribl well'

### Q3: Based on Constructing Term Document Matrix
#### **For the preprocessed corpus, construct a Term Document Matrix (using both inbuilt methods and from scratch), whose entries are:**
* Binary
* Actual Term Frequency
* Term Frequency with length normalization
* Term Frequency-Inverse Document Frequency