# Artificial Intelligence - Computer Assignment 3 - Bayesian Network

## Mehrdad Nourbakhsh(810194418)

In this project, we want to use Bayes rule for classifying news based on a short description of that news.

In [1]:
import numpy as np
import pandas as pd
import itertools
import collections
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

We have a dataset with nearly 23000 news. each news has a category. we want to create a model to classify the category of that news based on the short description of that news. we read the dataset from .csv file and load that into a panda dataframe.

In [2]:
dataFrame = pd.read_csv('Attachment/data.csv')
dataFrame.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description
0,0,"Katherine LaGrave, ContributorTravel writer an...",TRAVEL,2014-05-07,"EccentriCities: Bingo Parties, Paella and Isla...",https://www.huffingtonpost.com/entry/eccentric...,Påskekrim is merely the tip of the proverbial ...
1,1,Ben Hallman,BUSINESS,2014-06-09,Lawyers Are Now The Driving Force Behind Mortg...,https://www.huffingtonpost.com/entry/mortgage-...,
2,2,Jessica Misener,STYLE & BEAUTY,2012-03-12,Madonna 'Truth Or Dare' Shoe Line To Debut Thi...,https://www.huffingtonpost.com/entry/madonna-s...,"Madonna is slinking her way into footwear now,..."
3,3,"Victor and Mary, Contributor\n2Sense-LA.com",TRAVEL,2013-12-17,Sophistication and Serenity on the Las Vegas S...,https://www.huffingtonpost.com/entry/las-vegas...,But what if you're a 30-something couple that ...
4,4,"Emily Cohn, Contributor",BUSINESS,2015-03-19,It's Still Pretty Hard For Women To Get Free B...,https://www.huffingtonpost.com/entry/free-birt...,Obamacare was supposed to make birth control f...


We can see that each news has some other information too rather than the short description (e.g headline, authors). but we ignore them and use only words of the short description to train our model.

First, we should preprocess the data and use the result for the model. we want to normalize the short description.
we use nltk library for normalizing the text.normalization include these steps:
* converting uppercase to lowercase
* removing punctuation signs
* removing numbers
* converting each text into a list of words
* removing stop words
* using lemmatization to remove inflectional endings only and to return the base form of a word

If we don't convert all words to lowercase, our model might treat a word which is at the beginning of a sentence with a capital letter, different from the same word which appears later in the sentence but without any capital latter. this might influence our model accuracy. thus we convert all letters to lowercase.

In our model, we use words frequency and occurrences of them in the text. we want to find relevant results not only for the exact expression but also for the other possible forms of the words we used. for this purpose we use lemmatization. lemmatization helps us to treat all possible forms of a word as an individual word and this can improve our model with increasing word frequency for each category.

In [3]:
data = dataFrame.short_description
data = data.fillna('')
data = data.str.lower()
data = data.str.replace('[^\w\s]',' ')
data = data.str.replace('\d+', '')
stopWords = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
wordsList = data.apply(word_tokenize)
wordsList = wordsList.apply(lambda x: [item for item in x if item not in stopWords])
wordsList = wordsList.apply(lambda x: [lemmatizer.lemmatize(y, pos="v") for y in x])
dataFrame['words'] = wordsList
dataFrame.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description,words
0,0,"Katherine LaGrave, ContributorTravel writer an...",TRAVEL,2014-05-07,"EccentriCities: Bingo Parties, Paella and Isla...",https://www.huffingtonpost.com/entry/eccentric...,Påskekrim is merely the tip of the proverbial ...,"[påskekrim, merely, tip, proverbial, iceberg, ..."
1,1,Ben Hallman,BUSINESS,2014-06-09,Lawyers Are Now The Driving Force Behind Mortg...,https://www.huffingtonpost.com/entry/mortgage-...,,[]
2,2,Jessica Misener,STYLE & BEAUTY,2012-03-12,Madonna 'Truth Or Dare' Shoe Line To Debut Thi...,https://www.huffingtonpost.com/entry/madonna-s...,"Madonna is slinking her way into footwear now,...","[madonna, slink, way, footwear, truth, dare, p..."
3,3,"Victor and Mary, Contributor\n2Sense-LA.com",TRAVEL,2013-12-17,Sophistication and Serenity on the Las Vegas S...,https://www.huffingtonpost.com/entry/las-vegas...,But what if you're a 30-something couple that ...,"[something, couple, shy, away, table, dance, e..."
4,4,"Emily Cohn, Contributor",BUSINESS,2015-03-19,It's Still Pretty Hard For Women To Get Free B...,https://www.huffingtonpost.com/entry/free-birt...,Obamacare was supposed to make birth control f...,"[obamacare, suppose, make, birth, control, fre..."


We need to separate our data into two parts. the train set and evaluate set. we use 80 percent of data as train set and 20 percent as evaluate set. we want to have all sorts of news. if we choose 80 percent of real data as train set, we may have a set with only one or two categories, therefore, our model can not detect other categories as well. we have to create a train set with good diversity from all of the categories. 

for each category, we choose a random subset of that category (80 percent) and then combine those subsets to create our train set and the remaining 20 percent of each category for evaluate set.

In [4]:
travel = dataFrame[dataFrame['category'] == 'TRAVEL']
business = dataFrame[dataFrame['category'] == 'BUSINESS']
sb = dataFrame[dataFrame['category'] == 'STYLE & BEAUTY']
trainTravel = travel.sample(frac=0.8)
evaluateTravel = travel.drop(trainTravel.index)
trainBusiness = business.sample(frac=0.8)
evaluateBusiness = business.drop(trainBusiness.index)
trainSB = sb.sample(frac=0.8)
evaluateSB = sb.drop(trainSB.index)
trainData = pd.concat([trainTravel,trainBusiness, trainSB])
trainData = trainData.sample(frac=1)
evaluateData = pd.concat([evaluateSB,evaluateBusiness, evaluateTravel])
evaluateData = evaluateData.sample(frac=1)
trainData.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description,words
2827,2827,Michelle Persad,STYLE & BEAUTY,2012-10-23,Jessica Alba Pulls Off The Perfect Combination...,https://www.huffingtonpost.com/entry/jessica-a...,Want more? Be sure to check out Stylelist on T...,"[want, sure, check, stylelist, twitter, facebo..."
16662,16662,"Dr. Steve Rosenberg, Contributor\nPodiatrist, ...",STYLE & BEAUTY,2013-11-25,Solutions to Prevent Tired Feet and Legs Durin...,https://www.huffingtonpost.com/entry/solutions...,It is holiday time again along with the long l...,"[holiday, time, along, long, line, wait, buy, ..."
4318,4318,Mark Gongloff,BUSINESS,2014-08-01,Why A Higher Unemployment Rate Is Actually Goo...,https://www.huffingtonpost.com/entry/unemploym...,,[]
20213,20213,,TRAVEL,2012-05-31,Virgin Galactic Gets FAA Approval To Test Spac...,https://www.huffingtonpost.com/entry/virgin-ga...,"By: Brian Berger, Space News Published: 05/31/...","[brian, berger, space, news, publish, edt, spa..."
20577,20577,"24/7 Wall St., 24/7 Wall St.",BUSINESS,2014-01-25,States With The Least Government Benefits: 24/...,https://www.huffingtonpost.com/entry/states-go...,"Right now, the states already bear a substanti...","[right, state, already, bear, substantial, bur..."


After creating our train set, we should train our model with this set. 
for this purpose, we use Bayes rule.

$$ P(c|x) = \frac{P(x|c)\times P(c)}{P(x)} $$

As we can see, Bayes rule has 4 parts: Posterior, Likelihood, Prior and Evidence.in order to use this rule, we should define each part of the Bayes rule for our project.

The posterior probability is the probability of a category given the words in the news. we use Bayes rule to calculate this probability.
$$ posterior = {P(category | x_0, x_1, x_2, ...,  x_n)} $$
which is $x_n$ is the n-th word of that news.

We define the probability of each category as the prior probability which means how probable is it for a news to be for a certain category.

We define the likelihood as the probability of each word of a news given the category which means how probable it is for that category to use that word. in other words, the likelihood probability is:
$$ likelihood = {P(x_0, x_1, x_2, ...,  x_n|category)} $$
Since the probability of existing a word in a certain category is independent of the probability of existing another word in that category for each news, we can multiply these probabilities to calculate our conditional probability.
$$ likelihood = {P(x_0|category)\times P(x_1|category) \times P(x_2|category) \times ... \times P(x_n|category)} $$

The evidence is the probabiliy of all words that we have in a given news.
$$ evidence = {P(x_0, x_1, x_2, ...,  x_n)} $$



##  Part $\mathrm{I}$

In [5]:
travelWords = trainTravel.words.values
travelWords = list(itertools.chain.from_iterable(travelWords))
BusinessWords = trainBusiness.words.values
BusinessWords = list(itertools.chain.from_iterable(BusinessWords))
allTravelAndBusinessWord = list(map(''.join, set(itertools.chain(travelWords, BusinessWords))))
print(len(travelWords),len(BusinessWords),len(allTravelAndBusinessWord))
# newDataFrame = pd.DataFrame(columns=['Word','Travel occurrences','Business occurrences'])
# for word in allTravelAndBusinessWord:
#     newDataFrame = newDataFrame.append({'Word' : word,'Travel occurrences' : travelWords.count(word),'Business occurrences' : BusinessWords.count(word)},ignore_index=True)


103142 51994 17700


In [6]:
# newDataFrame['Travel Probability'] = newDataFrame['Travel occurrences'] / newDataFrame['Travel occurrences'].sum()
# newDataFrame['Business Probability'] = newDataFrame['Business occurrences'] / newDataFrame['Business occurrences'].sum()
# newDataFrame = newDataFrame.set_index('Word')
# newDataFrame

In [7]:
# travelAndBusinessEvaluateData = pd.concat([evaluateBusiness, evaluateTravel])
# travelAndBusinessEvaluateData = travelAndBusinessEvaluateData.sample(frac=1)


In [8]:
# for index,row in travelAndBusinessEvaluateData.iterrows():
#     travelPriorProbability = len(trainTravel)/(len(trainTravel)+len(trainBusiness))
#     businessPriorProbability = len(trainBusiness)/(len(trainTravel)+len(trainBusiness))
#     words = set(row['words'])
#     for word in words:
#         if word in allTravelAndBusinessWord:
#             travelPriorProbability *= newDataFrame.at[word,'Travel Probability']
#             businessPriorProbability *= newDataFrame.at[word,'Business Probability']

#     travelAndBusinessEvaluateData.at[index,'Travel Probability'] = travelPriorProbability
#     travelAndBusinessEvaluateData.at[index,'Business Probability'] = businessPriorProbability
#     if travelPriorProbability >= businessPriorProbability:
#         travelAndBusinessEvaluateData.at[index,'Prediction'] = 'TRAVEL'
#     else:
#         travelAndBusinessEvaluateData.at[index,'Prediction'] = 'BUSINESS'
# travelAndBusinessEvaluateData

In [9]:
# travelAndBusinessEvaluateData['correct'] = (travelAndBusinessEvaluateData['category'] == travelAndBusinessEvaluateData['Prediction'])
# correctDetected = (travelAndBusinessEvaluateData['correct']).sum()
# accuracy =  correctDetected / len(travelAndBusinessEvaluateData)
# accuracy

In [10]:
sbWords = trainSB.words.values
sbWords = list(itertools.chain.from_iterable(sbWords))
allTrainDataWords = list(map(''.join, set(itertools.chain(travelWords, BusinessWords,sbWords))))
print(len(travelWords),len(BusinessWords),len(sbWords),len(allTrainDataWords))
sbWordsCount = dict(collections.Counter(sbWords))
travelWordsCount = dict(collections.Counter(travelWords))
BusinessWordsCount = dict(collections.Counter(BusinessWords))
print(len(travelWordsCount),len(BusinessWordsCount),len(sbWordsCount),len(allTrainDataWords))
allDataFrame = pd.DataFrame(columns=['Word','Travel occurrences','Business occurrences','Style & Beauty occurrences'])
for word in allTrainDataWords:
    to = 0
    sbo = 0
    bo = 0
    if word in travelWordsCount:
        to = travelWordsCount[word]
    if word in BusinessWordsCount:
        bo = BusinessWordsCount[word]
    if word in sbWordsCount:
        sbo = sbWordsCount[word]
    allDataFrame = allDataFrame.append({'Word' : word,'Travel occurrences' : to,'Business occurrences' : bo,'Style & Beauty occurrences' : sbo},ignore_index=True)


103142 51994 86049 22233
14342 8703 11374 22233


In [11]:
alpha = 0.5
allDataFrame['Travel Probability'] = (allDataFrame['Travel occurrences'] + alpha) / (allDataFrame['Travel occurrences'].sum() + (len(set(travelWords + BusinessWords + sbWords))*alpha))
allDataFrame['Business Probability'] = allDataFrame['Business occurrences'] / (allDataFrame['Business occurrences'].sum() + (len(set(travelWords + BusinessWords + sbWords))*alpha))
allDataFrame['Style & Beauty Probability'] = allDataFrame['Style & Beauty occurrences'] / (allDataFrame['Style & Beauty occurrences'].sum() + (len(set(travelWords + BusinessWords + sbWords))*alpha))
allDataFrame = allDataFrame.set_index('Word')
allDataFrame

Unnamed: 0_level_0,Travel occurrences,Business occurrences,Style & Beauty occurrences,Travel Probability,Business Probability,Style & Beauty Probability
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
midways,1,0,0,1.31281e-05,0,0
injuries,0,1,1,4.37604e-06,1.58452e-05,1.02917e-05
comme,0,0,1,4.37604e-06,0,1.02917e-05
ct,1,0,0,1.31281e-05,0,0
binders,0,0,1,4.37604e-06,0,1.02917e-05
...,...,...,...,...,...,...
iphones,1,0,2,1.31281e-05,0,2.05834e-05
whas,0,1,0,4.37604e-06,1.58452e-05,0
dprk,1,0,0,1.31281e-05,0,0
see,333,105,491,0.00291882,0.00166375,0.00505323


In [12]:
for index,row in evaluateData.iterrows():
    travelPriorProbability = len(trainTravel)/(len(trainTravel)+len(trainBusiness)+len(trainSB))
    businessPriorProbability = len(trainBusiness)/(len(trainTravel)+len(trainBusiness)+len(trainSB))
    sbPriorProbability = len(trainSB)/(len(trainTravel)+len(trainBusiness)+len(trainSB))
    words = set(row['words'])
    for word in words:
        if word in allTrainDataWords:
            travelPriorProbability *= allDataFrame.at[word,'Travel Probability']
            businessPriorProbability *= allDataFrame.at[word,'Business Probability']
            sbPriorProbability *= allDataFrame.at[word,'Style & Beauty Probability']
    evaluateData.at[index,'Travel Probability'] = travelPriorProbability
    evaluateData.at[index,'Business Probability'] = businessPriorProbability
    evaluateData.at[index,'Style & Beauty Probability'] = sbPriorProbability
    if travelPriorProbability >= businessPriorProbability and travelPriorProbability >= sbPriorProbability:
        evaluateData.at[index,'Prediction'] = 'TRAVEL'
    if businessPriorProbability >= travelPriorProbability and businessPriorProbability >= sbPriorProbability:
        evaluateData.at[index,'Prediction'] = 'BUSINESS'
    if sbPriorProbability >= travelPriorProbability and sbPriorProbability>= businessPriorProbability:
        evaluateData.at[index,'Prediction'] = 'STYLE & BEAUTY'


In [13]:
evaluateData.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description,words,Travel Probability,Business Probability,Style & Beauty Probability,Prediction
503,503,"Mary Kincaid, Contributor\nFounder and Editor ...",STYLE & BEAUTY,2013-06-25,Weekly Roundup of eBay Vintage Clothing Finds,https://www.huffingtonpost.com/entry/weekly-ro...,"This week's selections include pieces by YSL, ...","[week, selections, include, piece, ysl, moschi...",6.168558e-53,0.0,4.843276e-45,STYLE & BEAUTY
17924,17924,"Oyster, ContributorVisiting, photographing, re...",TRAVEL,2015-01-16,The Sexiest Hotels in the Dominican Republic,https://www.huffingtonpost.com/entry/the-sexie...,Although language can be a barrier for non-Spa...,"[although, language, barrier, non, spanish, sp...",3.239074e-49,4.812035e-53,0.0,TRAVEL
8893,8893,"Susan Fogwell, Contributor\nLifestyle & Travel...",TRAVEL,2012-02-06,How To Plan A Caribbean Sailing Trip,https://www.huffingtonpost.com/entry/planning-...,"For experienced sailors, it is thrilling to sa...","[experience, sailors, thrill, sail, among, fou...",4.976807e-98,0.0,0.0,TRAVEL
1657,1657,Suzy Strutner,TRAVEL,2015-04-23,Turns Out Airplane 'Oxygen Masks' Aren't Exact...,https://www.huffingtonpost.com/entry/turns-out...,Frequent travelers know that in the event of c...,"[frequent, travelers, know, event, change, cab...",1.759336e-45,0.0,0.0,TRAVEL
11392,11392,"24/7 Wall St., 24/7 Wall St.",BUSINESS,2013-04-13,America's Fattest Cities: 24/7 Wall St.,https://www.huffingtonpost.com/entry/americas-...,Click here to see America’s fattest cities 24/...,"[click, see, america, fattest, cities, wall, s...",4.876743e-41,0.0,0.0,TRAVEL


In [24]:
evaluateData['Correct'] = (evaluateData['category'] == evaluateData['Prediction'])
correctDetected = (evaluateData['Correct']).sum()
accuracy =  correctDetected / len(evaluateData)
AllOfTravel = (evaluateData['category'] == 'TRAVEL').sum()
AllOfBusiness = (evaluateData['category'] == 'BUSINESS').sum()
AllOfSB = (evaluateData['category'] == 'STYLE & BEAUTY').sum()
correctTravel = (evaluateData.loc[(evaluateData['Correct'] == True) & (dataFrame['category'] == 'TRAVEL')].all(axis='columns')).sum()
correctBusiness = (evaluateData.loc[(evaluateData['Correct'] == True) & (dataFrame['category'] == 'BUSINESS')].all(axis='columns')).sum()
correctSB = (evaluateData.loc[(evaluateData['Correct'] == True) & (dataFrame['category'] == 'STYLE & BEAUTY')].all(axis='columns')).sum()
AllTravelDetected = (evaluateData['Prediction'] == 'TRAVEL').sum()
AllSBDetected = (evaluateData['Prediction'] == 'STYLE & BEAUTY').sum()
AllBusinessDetected = (evaluateData['Prediction'] == 'BUSINESS').sum()
recall1 = (correctTravel) / (AllOfTravel)
recall2 = correctBusiness / AllOfBusiness
recall3 = correctSB / AllOfSB
recall = recall1 + recall2 + recall3
precision1 = (correctTravel) / (AllTravelDetected)
precision2 = correctBusiness / AllBusinessDetected
precision3 = correctSB / AllSBDetected
precision = precision1 + precision2 + precision3
print(accuracy)
print(recall)
print(precision)

0.7431312690798081
0.34605442104875317
583
0.4712277485992446
