# Natural Language Processing (NLP)
- Lexical Processing (Lexicon - Words or Tokens) - Today's Agenda
- Syntactic Processing (Syntax/Grammar)
- Semantic Processing (Context/Meaning)

## Lexical Processing
- Text Preprocessing - Tokenisation, Stopword Removal, Stemming/Lemmatization
- Vectorisation - TF-IDF, BOW etc 
- Application - Text Classfication (Spam/Ham)

In [3]:
import nltk
# Natural Language Toolkit

In [5]:
#!pip install nltk

## Tokenisation

In [8]:
document = "At nine o'clock, I visited him myself. It looks like religious mania, and he'll soon think that he himself is God."
print(document)

At nine o'clock, I visited him myself. It looks like religious mania, and he'll soon think that he himself is God.


In [10]:
print(document.split())
# Rule Based Approach

['At', 'nine', "o'clock,", 'I', 'visited', 'him', 'myself.', 'It', 'looks', 'like', 'religious', 'mania,', 'and', "he'll", 'soon', 'think', 'that', 'he', 'himself', 'is', 'God.']


In [12]:
nltk.download('punkt')# model
# pre-trained model available in nltk for tokenisation

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/shivamgarg/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [14]:
from nltk.tokenize import word_tokenize #code
words = word_tokenize(document)
print(words)

['At', 'nine', "o'clock", ',', 'I', 'visited', 'him', 'myself', '.', 'It', 'looks', 'like', 'religious', 'mania', ',', 'and', 'he', "'ll", 'soon', 'think', 'that', 'he', 'himself', 'is', 'God', '.']


In [16]:
from nltk.tokenize import sent_tokenize #code
words = sent_tokenize(document)
print(words)

["At nine o'clock, I visited him myself.", "It looks like religious mania, and he'll soon think that he himself is God."]


## StopWords Removal

In [19]:
nltk.download("stopwords")# data

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shivamgarg/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
from nltk.corpus import stopwords # code

In [23]:
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [25]:
ls=stopwords.words('english')
len(ls)

198

In [27]:
sent="Natural language processing is that any computation and manipulation of natural language to get inside about how words mean and how sentences are contructed is natural language processing."
print(sent)

Natural language processing is that any computation and manipulation of natural language to get inside about how words mean and how sentences are contructed is natural language processing.


In [29]:
words=word_tokenize(sent)
print(words)

['Natural', 'language', 'processing', 'is', 'that', 'any', 'computation', 'and', 'manipulation', 'of', 'natural', 'language', 'to', 'get', 'inside', 'about', 'how', 'words', 'mean', 'and', 'how', 'sentences', 'are', 'contructed', 'is', 'natural', 'language', 'processing', '.']


In [31]:
words_after_stopwords=[word for word in words if word.lower() not in ls]
print("--------------")
print(words_after_stopwords)

--------------
['Natural', 'language', 'processing', 'computation', 'manipulation', 'natural', 'language', 'get', 'inside', 'words', 'mean', 'sentences', 'contructed', 'natural', 'language', 'processing', '.']


In [33]:
len(words),len(words_after_stopwords)

(29, 17)

In [37]:
stopwords.words('spanish')

['de',
 'la',
 'que',
 'el',
 'en',
 'y',
 'a',
 'los',
 'del',
 'se',
 'las',
 'por',
 'un',
 'para',
 'con',
 'no',
 'una',
 'su',
 'al',
 'lo',
 'como',
 'más',
 'pero',
 'sus',
 'le',
 'ya',
 'o',
 'este',
 'sí',
 'porque',
 'esta',
 'entre',
 'cuando',
 'muy',
 'sin',
 'sobre',
 'también',
 'me',
 'hasta',
 'hay',
 'donde',
 'quien',
 'desde',
 'todo',
 'nos',
 'durante',
 'todos',
 'uno',
 'les',
 'ni',
 'contra',
 'otros',
 'ese',
 'eso',
 'ante',
 'ellos',
 'e',
 'esto',
 'mí',
 'antes',
 'algunos',
 'qué',
 'unos',
 'yo',
 'otro',
 'otras',
 'otra',
 'él',
 'tanto',
 'esa',
 'estos',
 'mucho',
 'quienes',
 'nada',
 'muchos',
 'cual',
 'poco',
 'ella',
 'estar',
 'estas',
 'algunas',
 'algo',
 'nosotros',
 'mi',
 'mis',
 'tú',
 'te',
 'ti',
 'tu',
 'tus',
 'ellas',
 'nosotras',
 'vosotros',
 'vosotras',
 'os',
 'mío',
 'mía',
 'míos',
 'mías',
 'tuyo',
 'tuya',
 'tuyos',
 'tuyas',
 'suyo',
 'suya',
 'suyos',
 'suyas',
 'nuestro',
 'nuestra',
 'nuestros',
 'nuestras',
 'vuestro'

### Lemmatization & Stemming

Stemming:
1. It is a rule base approach hence ended up giving non-english words.
2. It is very fast approach.

In [41]:
sent = "task tasked tasks tasking keys mangoes computing looking"

In [43]:
words=word_tokenize(sent)
words_after_stopwords=[word for word in words if word.lower() not in ls]
print(words_after_stopwords)

['task', 'tasked', 'tasks', 'tasking', 'keys', 'mangoes', 'computing', 'looking']


In [45]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [47]:
words_after_stemming=[stemmer.stem(word) for word in words_after_stopwords]
print(words_after_stemming)

['task', 'task', 'task', 'task', 'key', 'mango', 'comput', 'look']


Lemmatization:
1. It is a corpus based approach. It is very lineant approach.
2. It is a slow approach.

In [50]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/shivamgarg/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [52]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In [54]:
words_after_lemmatization=[wordnet_lemmatizer.lemmatize(word) for word in words_after_stopwords]
print(words)
print(words_after_lemmatization)

['task', 'tasked', 'tasks', 'tasking', 'keys', 'mangoes', 'computing', 'looking']
['task', 'tasked', 'task', 'tasking', 'key', 'mango', 'computing', 'looking']


### Vectorisation

In [57]:
from sklearn.feature_extraction.text import TfidfVectorizer

#### TF-IDF --> Vectorisation Method
- TF --> Term Frequency - How many times that word is coming in the document/ total number of words in that document
- IDF --> Inverse Document Frequncy --> Inverse of No. of documents contains that word/Total number of documents
- - High the value of TF as well as IDF, more important the word is and vice-versa

In [60]:
corpus = [
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly",
    "The dog is not lazy"
]

In [62]:
# all pre-processing to be performed
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

In [64]:
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

In [66]:
feature_names = tfidf_vectorizer.get_feature_names_out()
feature_names

array(['brown', 'dog', 'fox', 'jump', 'jumps', 'lazy', 'quick', 'quickly'],
      dtype=object)

In [68]:
import pandas as pd
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

In [70]:
tfidf_df

Unnamed: 0,brown,dog,fox,jump,jumps,lazy,quick,quickly
0,0.461381,0.272499,0.461381,0.0,0.461381,0.272499,0.461381,0.0
1,0.0,0.359594,0.0,0.608845,0.0,0.359594,0.0,0.608845
2,0.0,0.707107,0.0,0.0,0.0,0.707107,0.0,0.0


"The quick brown fox jumps over the lazy dog" --> [0.461381,0.272499,0.461381,0.000000,0.461381,0.272499,0.461381,0.000000]

<a id="12"></a> <br>
# Text Classification
*  Classify male and female according to their tweets(description)
* import twitter data set from "Twitter User Gender Classification"

In [73]:
data = pd.read_csv(r"gender-classifier-DFE-791531.csv",encoding='latin1')
data = pd.concat([data.gender,data.description],axis=1)
#https://drive.google.com/file/d/10-YHdBzry9hMdrM5ct2JXvQmJN0cm7Fo/view?usp=drive_link

In [77]:
data.head(2)

Unnamed: 0,gender,description
0,male,i sing my own rhythm.
1,male,I'm the author of novels filled with family dr...


In [79]:
data=data[data["gender"].isin(["female","male"])]

In [81]:
# drop nan values
data.dropna(inplace=True,axis=0)

In [83]:
data.shape

(11194, 2)

In [85]:
# convert genders from female and male to 1 and 0 respectively
data.gender = [1 if each == "female" else 0 for each in data.gender] 

In [87]:
data.head(2)

Unnamed: 0,gender,description
0,0,i sing my own rhythm.
1,0,I'm the author of novels filled with family dr...


In [89]:
def preprocessing(sent):
    words=word_tokenize(sent) # tokenisation
    words_after_stopwords=[word for word in words if word.lower() not in stopwords.words('english')] # stopword removal
    words_after_stemming=[stemmer.stem(word) for word in words_after_stopwords] # Stemming
    return " ".join(words_after_stemming)

In [91]:
%%time
data["description"]=data["description"].apply(preprocessing)

CPU times: user 6.97 s, sys: 1.78 s, total: 8.75 s
Wall time: 8.98 s


In [93]:
data.head(2)

Unnamed: 0,gender,description
0,0,sing rhythm .
1,0,'m author novel fill famili drama romanc .


In [105]:
vect=TfidfVectorizer(max_features=200,min_df=10)# max_features- top n words,
X=vect.fit_transform(data.description)
feature_names = vect.get_feature_names_out()
X = pd.DataFrame(X.toarray(), columns=feature_names)
X.head(2)

Unnamed: 0,10,13,14,15,16,17,18,19,20,2015,...,work,world,write,writer,year,youtub,ªá,êû,ï_,ïî
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [107]:
y = data.iloc[:,0].values

In [109]:
# train test split
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X,y,test_size = 0.3,random_state = 0)

In [117]:
# naive bayes
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [113]:
nb = MultinomialNB()
nb.fit(x_train,y_train)

In [115]:
y_pred = nb.predict(x_train)
y_pred_test = nb.predict(x_test)

In [119]:
print(round(accuracy_score(y_train,y_pred)*100,1),"%")
print(round(accuracy_score(y_test,y_pred_test)*100,1),"%")

63.4 %
61.1 %
