<a href="https://colab.research.google.com/github/Rameshkumar789/Computational-Menthods-INFO-5731/blob/main/Korlakunta_In_class_exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The fifth In-class-exercise (2/23/2021, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)

## Data pre-processing

In [1]:
import tweepy as tw
import pandas as pd

In [2]:
def scrape_twitter(consumer_keys, consumer_secret, access_token, access_token_secret):
    auth = tw.OAuthHandler(consumer_keys, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    api = tw.API(auth)
    
    query = '#Covid'
    count = 500
    
    tweets = tw.Cursor(api.search,q=query+ " -filter:retweets", lang="en").items(count)
    temp=[]
    for i in tweets:
        temp.append(i.text)
    
    data=pd.DataFrame({'Original':temp})
    return data

In [None]:
consumer_keys=input("Enter the consumer key")
consumer_secret=input("Enter the consumer_secret")
access_token=input("Enter the access_token")
access_token_secret=input("Enter the access_token_secret")

In [4]:
data = scrape_twitter(consumer_keys, consumer_secret, access_token, access_token_secret)


In [5]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop=stopwords.words('english')
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
import numpy as np
import pandas as pd
nltk.download('punkt')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [6]:
data['Modified']=data['Original']


In [9]:
#Noise
data['Modified']=data['Modified'].str.replace(r'[^\w\s]+', '')

#Numbers
data['Modified']= data['Modified'].apply(lambda x: " ".join(x for x in x.split() if not x.isdigit()))

#Stopwords removal
data['Modified']=data['Modified'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

#Lower casing
data['Modified']=data['Modified'].apply(lambda x: " ".join(x.lower() for x in x.split()))

#Stemming
temp=PorterStemmer()
data['Modified']=data['Modified'].apply(lambda x: " ".join([temp.stem(word) for word in x.split()]))

#Lemmatization
data['Modified']=data['Modified'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

In [10]:
processed = pd.DataFrame(data=data['Modified'])

In [11]:
processed['Modified']

0      look job get touch u amp find latest job get n...
1      diver zoonot batborn viru respon disea human c...
2      differ covid vaccin handl new variant viru bre...
3      friend plan trip never happen becauseofcoronav...
4      bugger think im allerg vaccin covid httpstcoxl...
                             ...                        
495    opseu cec urg provinc vaccin frontlin colleg f...
496    omg new covid case highest far peopl better ht...
497    know exactli need hear youv vaccin doesnt mean...
498    covid teach tip mail lab kit student result su...
499    sen_joemanchindncjoebidenpotu sen manchin tri ...
Name: Modified, Length: 500, dtype: object

## Extract Features

In [15]:
from nltk import word_tokenize, pos_tag
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [16]:
nouns=[]
verbs=[]
adjective=[]
adverbs=[]
coordinate=[]
subordinate=[]
determine=[]
sentsize=[]
pronoun=[]
interjection =[]

for i in range(len(processed)):
    words = word_tokenize(processed['Modified'][i])
    tag = pos_tag(words)
    
    n=0
    v=0
    adj=0
    adv=0
    coord=0
    sub=0
    det=0
    pro=0
    inj=0

    
    for j in range(len(tag)):
        if (tag[j][1]=='NN' or tag[j][1]=='NNS' or tag[j][1]=='NNP'or tag[j][1]=='NNPS'):
            n=n+1
        elif (tag[j][1]=='VB'or tag[j][1]=='VBG' or tag[j][1]=='VBD'or tag[j][1]=='VBN'or tag[j][1]=='VBP'or tag[j][1]=='VBZ'):
            v=v+1
        elif (tag[j][1]=='JJ'or tag[j][1]=='JJR' or tag[j][1]=='JJS'):
            adj=adj+1
        elif (tag[j][1]=='RB'or tag[j][1]=='RBR' or tag[j][1]=='RBS'):
            adv=adv+1
        elif(tag[j][1]=='CC'):
            coord=coord+1
        elif(tag[j][1]=='IN'):
            sub=sub+1
        elif(tag[j][1]=='DT'):
            det=det+1
        elif (tag[j][1]=='PRP'or tag[j][1]=='PRP$' or tag[j][1]=='WP'):
            pro=pro+1
        elif(tag[j][1]=='UH'):
            inj=inj+1

        
    sentsize.append(len(tag))        
    nouns.append(n)
    verbs.append(v)
    adjective.append(adj)
    adverbs.append(adv)
    coordinate.append(coord)
    subordinate.append(sub)
    determine.append(det)
    pronoun.append(pro)
    interjection.append(inj)
    

## Tf-IDF features 

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(processed['Modified'])

In [20]:
feature_names = v.get_feature_names()
den = x.todense()
denlist = den.tolist()
tf_weights = pd.DataFrame(denlist, columns=feature_names)

In [21]:
tf_weights

Unnamed: 0,10newsfirstp,2021blc,24mil,260k,26m,2immun,2liter,2nd,2vaccin,3week,5100th,5th,68pm,7mil,906ami,9th,abc,abcnew,abcscienc,abil,absolut,accept,access,accidentalp,accord,account,acct,acknowledg,across,act,action,activ,actual,adam,adapt,addit,adhanom,adjust,administr,adopt,...,wor,word,work,worker,world,worri,worst,worth,would,wound,write,wt,wtf,wuhan,wupdat,wv,ya,yall,ye,year,yeeeeesssssss,yeg,yellow,yesterday,yet,yo,yoder_esq,york,youll,youtub,youv,yr,zhan,zhang,zonal,zoom,zoonot,zweden,ニジカノ,日本
0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
1,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.384416,0.0,0.0,0.0
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
4,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.242926,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
496,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
497,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.356149,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
498,0.0,0.297652,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0


## Features

In [22]:
processed['Nouns']=nouns
processed['Verbs']=verbs
processed['Adjective']=adjective
processed['Adverbs']=adverbs
processed['Coordinating_Conjunction']=coordinate
processed['Subordinating_Conjunction']=subordinate
processed['Determiner']=determine
processed['Sentence_Length']=sentsize
processed['Pronoun']=pronoun
processed['Interjection']=interjection

In [23]:
processed

Unnamed: 0,Modified,Nouns,Verbs,Adjective,Adverbs,Coordinating_Conjunction,Subordinating_Conjunction,Determiner,Sentence_Length,Pronoun,Interjection
0,look job get touch u amp find latest job get n...,8,3,5,0,0,0,0,16,0,0
1,diver zoonot batborn viru respon disea human c...,7,0,2,1,0,0,0,10,0,0
2,differ covid vaccin handl new variant viru bre...,9,1,3,0,0,0,0,13,0,0
3,friend plan trip never happen becauseofcoronav...,7,1,0,1,0,0,0,9,0,0
4,bugger think im allerg vaccin covid httpstcoxl...,5,1,1,0,0,0,0,7,0,0
...,...,...,...,...,...,...,...,...,...,...,...
495,opseu cec urg provinc vaccin frontlin colleg f...,9,0,3,0,0,0,0,12,0,0
496,omg new covid case highest far peopl better ht...,3,0,4,2,0,0,0,9,0,0
497,know exactli need hear youv vaccin doesnt mean...,5,3,3,1,0,0,0,12,0,0
498,covid teach tip mail lab kit student result su...,7,0,3,0,0,0,0,12,0,0
