Importing Relevant Modules

### Working with Natural Language processing in python Using NLTK Tools

In [10]:
import pandas as pd
import numpy as np
from numpy.random import randint 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix as cfm
from sklearn.metrics import f1_score,r2_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder,OneHotEncoder
from sklearn import tree,metrics

### Brief write-ups on NLTK

##### APIs for steaming data:
Twitter Api, Reddit Api, News Api is one of such module in which helps for this purpose. These APIs could be connected using python Naural language Toolkit, beautifulSoup e.t.c, also in order to stream their data into SQL Database, we can use python module mysql-connector module.

##### About NLTK (Natural Language ToolKit)
NLTK stands for Natural Language Toolkit. It provides us with a large corpus (text) to use. The data must be labelled for supervised learning algorithms. So, we also want the language text to train our model to be tagged.

NLTK has in-built methods to facilitate easy access and intuitive use of corpora (large bodies of text, plural of corpus). Once the corpora are downloaded, it can be accessed via the NLTK's corpus module.

##### Tokenization
Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens.

##### Stop Words
Stop words are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts .For some applications like documentation classification, it may make sense to remove stop words.


### PART 3

Analyzing True and Fake News DataSet

In [2]:
true = pd.read_csv('TrueNews.csv')
fake = pd.read_csv('FakeNews.csv')

In [3]:
true['label'] = 1 # set label True = 1
fake['label'] = 0 # set label Fale = 0

In [4]:
df = pd.concat([true,fake]) # Concatinate True and false DataSet

In [5]:
df.drop('date',axis=1,inplace=True) # Drop date because it's not needed for analysis

In [6]:
df.drop('subject',axis=1,inplace=True) # Drop subject because it's not needed for analysis

In [7]:
df.head()

Unnamed: 0,title,text,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,1


In [8]:
df.shape

(44889, 3)

In [11]:
# Importing NLTK Libraries

import nltk
import string
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from scipy.spatial.distance import pdist,squareform
from nltk.corpus import words,names,brown

In [12]:
stop = set(nltk.corpus.stopwords.words('english'))

In [13]:
#Cleaning title columns
df['titleCleaned'] = df.title.apply(lambda x: ' '.join([word for word in x.split() if word not in stop and 
                                                          word not in string.punctuation]))

In [14]:
#cleaning text columns
df['textCleaned'] = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stop and 
                                                          word not in string.punctuation]))

In [15]:
# converting to lower case
df['textCleaned'] = df.textCleaned.str.lower()
df['titleCleaned'] = df.titleCleaned.str.lower()

In [16]:
df.drop(columns =['title','text'],inplace=True)

In [17]:
df.head()

Unnamed: 0,label,titleCleaned,textCleaned
0,1,"as u.s. budget fight looms, republicans flip f...",washington (reuters) the head conservative rep...
1,1,u.s. military accept transgender recruits mond...,washington (reuters) transgender people allowe...
2,1,senior u.s. republican senator: 'let mr. muell...,washington (reuters) the special counsel inves...
3,1,fbi russia probe helped australian diplomat ti...,washington (reuters) trump campaign adviser ge...
4,1,trump wants postal service charge 'much more' ...,seattle/washington (reuters) president donald ...


In [18]:
x1 = df.drop('label',axis=1) 
y = df.label

In [19]:
ordinal = OrdinalEncoder()

In [20]:
x = pd.DataFrame(ordinal.fit_transform(x1),columns=x1.columns)

In [21]:
X_train,X_test,y_train,y_test = train_test_split(x,y, random_state=42)

In [22]:
print('X_train.shape %s, X_test.shape %s, y_train.shape %s, y_test.shape %s' %(X_train.shape,X_test.shape,y_train.shape,y_test.shape))

X_train.shape (33666, 2), X_test.shape (11223, 2), y_train.shape (33666,), y_test.shape (11223,)


In [23]:
clf = tree.DecisionTreeClassifier(max_depth=3)

In [24]:
clf.fit(X_train,y_train)
pred = clf.predict(X_test) 
clf.score(X_test,y_test)

0.673438474561169

In [27]:
#print(pred)

In [25]:
cfm(y_test,pred)

array([[5770,   57],
       [3608, 1788]], dtype=int64)

### PART 4

Importing dataBase CSV contents and Analysing News Headlines dataset

In [24]:
data = pd.read_csv('News.csv')

In [25]:
data.head()

Unnamed: 0,source,title,description
0,Financial Times,The hard business lessons Covid is about to teach,"Why the hair salon, the gym and work-related t..."
1,Financial Times,Thai activist vows to escalate protests agains...,Anon Nampa criticises king for living abroad a...
2,Financial Times,Trump calls for deal on new fiscal stimulus,President tweets US ‘wants and needs’ further ...
3,Financial Times,Former Chinese government official ran TikTok’...,Cai Zheng was diplomat in China’s embassy in T...
4,Financial Times,Virus result puts focus on Donald Trump’s medi...,US president’s physicians have been accused of...


In [26]:
mapping = {'Financial Times':True,
          'Breitbart News':True,
          'News24':False,
          'News.com.au':False,
           'CBS News':True,
           'IGN':False,
           'Fox News':True,
           'ESPN':True
           
          }

In [27]:
# setting label
data['label'] = data.source.map(mapping)

In [28]:
data.head()

Unnamed: 0,source,title,description,label
0,Financial Times,The hard business lessons Covid is about to teach,"Why the hair salon, the gym and work-related t...",True
1,Financial Times,Thai activist vows to escalate protests agains...,Anon Nampa criticises king for living abroad a...,True
2,Financial Times,Trump calls for deal on new fiscal stimulus,President tweets US ‘wants and needs’ further ...,True
3,Financial Times,Former Chinese government official ran TikTok’...,Cai Zheng was diplomat in China’s embassy in T...,True
4,Financial Times,Virus result puts focus on Donald Trump’s medi...,US president’s physicians have been accused of...,True


In [29]:
#cleaning data
data['titleCleaned'] = data.title.apply(lambda x: ' '.join([word for word in x.split() if word not in stop and 
                                                          word not in string.punctuation]))

data['descriptionCleaned'] = data.description.apply(lambda x: ' '.join([word for word in x.split() if word not in stop and 
                                                          word not in string.punctuation]))

In [30]:
data['titleCleaned'] = data.titleCleaned.str.lower()
data['descriptionCleaned'] = data.titleCleaned.str.lower()

In [31]:
data.drop(columns =['title','description'],inplace=True)

In [32]:
data.head()

Unnamed: 0,source,label,titleCleaned,descriptionCleaned
0,Financial Times,True,the hard business lessons covid teach,the hard business lessons covid teach
1,Financial Times,True,thai activist vows escalate protests monarchy,thai activist vows escalate protests monarchy
2,Financial Times,True,trump calls deal new fiscal stimulus,trump calls deal new fiscal stimulus
3,Financial Times,True,former chinese government official ran tiktok’...,former chinese government official ran tiktok’...
4,Financial Times,True,virus result puts focus donald trump’s medical...,virus result puts focus donald trump’s medical...


In [33]:
x2 = data.drop('label',axis=1) 
Y = data.label

In [34]:
X = pd.DataFrame(ordinal.fit_transform(x2),columns=x2.columns)

In [35]:
x_train,x_test,Y_train,Y_test = train_test_split(X,Y, random_state=42)

In [36]:
lg = LogisticRegression()

In [37]:
lg.fit(x_train,Y_train)
Pred = lg.predict(x_test) 
lg.score(x_test,Y_test)

1.0

In [38]:
cfm(Y_test,Pred)

array([[1, 0],
       [0, 4]], dtype=int64)