<a href="https://colab.research.google.com/github/DurgalakshmiU/durgaa_INFO5731_Spring2021/blob/main/In_class_exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The fifth In-class-exercise (2/23/2021, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)

In [None]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url='https://www.imdb.com/list/ls050782187/?sort=list_order,asc&st_dt=&mode=detail&page='
l1=[]

#for getting texts from pages
for page in range(1,6):         
    req=requests.get(url + str(page))
    soup=bs(req.content,'html.parser')
    result=soup.find_all('h3',{'class':'lister-item-header'})
    for i in result:
      l1.append(i.text)
      #print(i.text)           #collection of 500 texts
#Collected Texts in csv file
import re
df=pd.DataFrame()
df['Movie name']=l1
df = df.replace('\n','', regex=True)
df.to_csv(r'/content/movietop.csv',index=True) 
df

Unnamed: 0,Movie name
0,1.The Godfather(1972)
1,2.The Silence of the Lambs(1991)
2,3.Star Wars: Episode V - The Empire Strikes Ba...
3,4.The Shawshank Redemption(1994)
4,5.The Shining(1980)
...,...
495,"496.Me, Myself & Irene(2000)"
496,497.The Darjeeling Limited(2007)
497,498.Fear(1996)
498,499.Planet Terror(2007)


In [None]:
#cleaning
df = pd.read_csv('movietop.csv')
df.to_csv("movietop.csv", index=False)
df['cleansentence'] = df['Movie name'].apply(lambda a: " ".join(a.lower() for a in a.split()))
df['cleansentence'].head()  #lower casing
df['cleansentence'] =df['cleansentence'] .str.replace('[^\w\s]','')
df['cleansentence'] .head()  # special char and punctuation removal
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['cleansentence']= df['cleansentence'].apply(lambda a: " ".join(a for a in a.split() if a not in stop))
df['cleansentence'].head()  #stopwords removal
df['cleansentence'] = df['cleansentence'].str.replace("[0-9]", " ")
df['cleansentence'].head()      #Numbers removal
from nltk.stem import PorterStemmer
st = PorterStemmer()    #stemming
df['cleansentence'][:6].apply(lambda a: " ".join([st.stem(word) for word in a.split()]))
nltk.download('wordnet')
from textblob import Word
df['cleansentence'] = df['cleansentence'].apply(lambda a: " ".join([Word(word).lemmatize() for word in a.split()]))
df['cleansentence'].head()   #Lemmatization
df.to_csv('movietop.csv')     #Adding cleaned sentence column to csv file
df                  #cleaned sentence

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Movie name,cleansentence
0,0,0,1.The Godfather(1972),the godfather
1,1,1,2.The Silence of the Lambs(1991),the silence lamb
2,2,2,3.Star Wars: Episode V - The Empire Strikes Ba...,star war episode v empire strike back
3,3,3,4.The Shawshank Redemption(1994),the shawshank redemption
4,4,4,5.The Shining(1980),the shining
...,...,...,...,...
495,495,495,"496.Me, Myself & Irene(2000)",me irene
496,496,496,497.The Darjeeling Limited(2007),the darjeeling limited
497,497,497,498.Fear(1996),fear
498,498,498,499.Planet Terror(2007),planet terror


In [None]:
#tf-idf features
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
import pandas as pd
import re
sentences = list()
df = pd.read_csv('movietop.csv')
df['Movie name']= df['Movie name'].str.replace('\d+', '')
df['Movie name'] =df['Movie name'] .str.replace('[^\w\s]','')
df.to_csv("movietop1.csv", index=False)
with open("movietop1.csv") as file:
    for line in file:
        for l in re.split(r"\.\s|\?\s|\!\s|\n",line):
            if l:
                sentences.append(l)
cvec = CountVectorizer(stop_words='english', min_df=3, max_df=0.5, ngram_range=(1,2))
s = cvec.fit_transform(sentences)
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(s)
w = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'term': cvec.get_feature_names(), 'Result of tf-idf': w})
a=pd.DataFrame()
a=weights_df.sort_values(by='Result of tf-idf', ascending=False)
a.head(500)

Unnamed: 0,term,Result of tf-idf
32,man,0.026411
9,dead,0.012355
0,american,0.00998
7,dark,0.00998
17,good,0.00998
33,men,0.007984
31,love,0.007984
26,life,0.007984
6,city,0.007372
15,game,0.007372


In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize, pos_tag, pos_tag_sents
nltk.download('averaged_perceptron_tagger')
!pip install counter
from collections import Counter
import pandas as pd
nltk.download('universal_tagset')
t= df['cleansentence'].tolist()
map(word_tokenize, t)
df['cleansentence'] = df['cleansentence'].astype(str)
df['cleansentence'].apply(word_tokenize)
p=pd.DataFrame()
pos_tag=[]
Count=0
v=0
for z in df['cleansentence']:
  token=nltk.word_tokenize(z)
  pos_tag.append(nltk.pos_tag(token,tagset='universal'))
  Count=(dict(nltk.FreqDist(tag for (word, tag) in pos_tag[v])))
  Count['Texts']=z
  p=p.append(Count,ignore_index=True)
  #print(p)
  v=v+1
df1=pd.DataFrame
df1=p.fillna(0)
df1['sentence length']=df['cleansentence'].apply(len)
df1

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


Unnamed: 0,DET,NOUN,Texts,ADJ,ADV,NUM,VERB,ADP,PRT,CONJ,PRON,sentence length
0,1.0,1.0,the godfather,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13
1,1.0,2.0,the silence lamb,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16
2,0.0,5.0,star war episode v empire strike back,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,37
3,1.0,2.0,the shawshank redemption,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24
4,1.0,1.0,the shining,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11
...,...,...,...,...,...,...,...,...,...,...,...,...
495,0.0,1.0,me irene,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,8
496,1.0,1.0,the darjeeling limited,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,22
497,0.0,1.0,fear,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
498,0.0,2.0,planet terror,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13


In [None]:
#Dependancy parsing
import spacy
nlp=spacy.load('en_core_web_sm')
li=[]
for i in range(0,500):
 text1=(df['cleansentence'][i])
 li.append(text1)
sentence = " ".join(li)
for token in nlp(sentence):
 print(token.text,'=>',token.dep_,'=>',token.head.text)


the => det => godfather
godfather => nsubj => strike
the => det => war
silence => compound => war
lamb => compound => star
star => compound => war
war => nsubj => episode
episode => relcl => godfather
v => compound => strike
empire => compound => strike
strike => ROOT => strike
back => advmod => strike
the => det => redemption
shawshank => compound => redemption
redemption => dobj => strike
the => det => casablanca
shining => amod => casablanca
casablanca => nsubj => flew
one => nummod => casablanca
flew => relcl => redemption
cuckoo => compound => raider
nest => compound => raider
indiana => compound => jones
jones => compound => raider
raider => nsubj => lost
lost => ROOT => lost
ark => prep => lost
the => det => ring
lord => compound => ring
ring => nsubj => return
return => compound => war
king => compound => war
star => compound => war
war => nsubj => episode
episode => ROOT => episode
iv => nmod => hope
new => amod => hope
hope => ROOT => hope
the => det => knight
dark => amod =>

In [None]:
#Named Entity Recognition
import pandas as pd
#df = pd.read_csv('abst1.csv')
import spacy
import nltk
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()
V=[]
for i in df['cleansentence']:
  for b in nlp(i).ents:
    V.append((b.text,b.label_))
    print([b.text,b.label_])  #extracted entities
#nltk.FreqDist(label for (label,text) in V) 
          


['one', 'CARDINAL']
['indiana', 'GPE']
['seven', 'CARDINAL']
['apocalypse', 'PERSON']
['two', 'CARDINAL']
['north northwest', 'PERSON']
['terug naar de', 'ORG']
['american', 'NORP']
['forrest gump', 'PERSON']
['een zwerftocht de ruimte', 'PERSON']
['da leben der anderen', 'PERSON']
['cidade de', 'PERSON']
['third', 'ORDINAL']
['magnolia', 'NORP']
['de hobbit de woestenij van smaug', 'PERSON']
['mad max fury road', 'PERSON']
['year', 'DATE']
['indiana', 'GPE']
['art thou', 'PERSON']
['lawrence arabia', 'PERSON']
['de verloren zoon', 'PERSON']
['mielensäpahoittaja', 'ORG']
['ben hur', 'PERSON']
['der untergang', 'PERSON']
['gran torino', 'PERSON']
['de maniak', 'PERSON']
['wie bang', 'PERSON']
['virginia', 'GPE']
['de hobbit een onverwachte', 'PERSON']
['avatar', 'ORG']
['de zomer van de witte haai', 'PERSON']
['being john malkovich', 'PERSON']
['jackie brown', 'PERSON']
['american', 'NORP']
['paris', 'GPE']
['first', 'ORDINAL']
['million dollar', 'MONEY']
['harry brown', 'PERSON']
['la 