# Artificial Intelligence - Computer Assignment 3 - Bayesian Network

## Mehrdad Nourbakhsh(810194418)

In this project, we want to use Bayes rule for classifying news based on a short description of that news.

In [1]:
import numpy as np
import pandas as pd
import itertools
import collections
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

We have a dataset with nearly 23000 news. each news has a category. we want to create a model to classify the category of that news based on the short description of that news. we read the dataset from .csv file and load that into a panda dataframe.

In [2]:
dataFrame = pd.read_csv('Attachment/data.csv')
dataFrame.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description
0,0,"Katherine LaGrave, ContributorTravel writer an...",TRAVEL,2014-05-07,"EccentriCities: Bingo Parties, Paella and Isla...",https://www.huffingtonpost.com/entry/eccentric...,Påskekrim is merely the tip of the proverbial ...
1,1,Ben Hallman,BUSINESS,2014-06-09,Lawyers Are Now The Driving Force Behind Mortg...,https://www.huffingtonpost.com/entry/mortgage-...,
2,2,Jessica Misener,STYLE & BEAUTY,2012-03-12,Madonna 'Truth Or Dare' Shoe Line To Debut Thi...,https://www.huffingtonpost.com/entry/madonna-s...,"Madonna is slinking her way into footwear now,..."
3,3,"Victor and Mary, Contributor\n2Sense-LA.com",TRAVEL,2013-12-17,Sophistication and Serenity on the Las Vegas S...,https://www.huffingtonpost.com/entry/las-vegas...,But what if you're a 30-something couple that ...
4,4,"Emily Cohn, Contributor",BUSINESS,2015-03-19,It's Still Pretty Hard For Women To Get Free B...,https://www.huffingtonpost.com/entry/free-birt...,Obamacare was supposed to make birth control f...


We can see that each news has some other information too rather than the short description (e.g headline, authors). but we ignore them and use only words of the short description to train our model.

First, we should preprocess the data and use the result for the model. we want to normalize the short description.
we use nltk library for normalizing the text.normalization include these steps:
* converting uppercase to lowercase
* removing punctuation signs
* removing numbers
* converting each text into a list of words
* removing stop words
* using lemmatization to remove inflectional endings only and to return the base form of a word

If we don't convert all words to lowercase, our model might treat a word which is at the beginning of a sentence with a capital letter, different from the same word which appears later in the sentence but without any capital latter. this might influence our model accuracy. thus we convert all letters to lowercase.

In our model, we use words frequency and occurrences of them in the text. we want to find relevant results not only for the exact expression but also for the other possible forms of the words we used. for this purpose we use lemmatization. lemmatization helps us to treat all possible forms of a word as an individual word and this can improve our model with increasing word frequency for each category.

In [21]:
def cleanText(df):
    
    data = df.short_description
    data = data.fillna('')
    data = data.str.lower()
    data = data.str.replace('[^\w\s]',' ')
    data = data.str.replace('\d+', '')
    stopWords = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    wordsList = data.apply(word_tokenize)
    wordsList = wordsList.apply(lambda x: [item for item in x if item not in stopWords])
    wordsList = wordsList.apply(lambda x: [lemmatizer.lemmatize(y, pos="v") for y in x])
    df['words'] = wordsList
    return df

dataFrame = cleanText(dataFrame)
dataFrame.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description,words
0,0,"Katherine LaGrave, ContributorTravel writer an...",TRAVEL,2014-05-07,"EccentriCities: Bingo Parties, Paella and Isla...",https://www.huffingtonpost.com/entry/eccentric...,Påskekrim is merely the tip of the proverbial ...,"[påskekrim, merely, tip, proverbial, iceberg, ..."
1,1,Ben Hallman,BUSINESS,2014-06-09,Lawyers Are Now The Driving Force Behind Mortg...,https://www.huffingtonpost.com/entry/mortgage-...,,[]
2,2,Jessica Misener,STYLE & BEAUTY,2012-03-12,Madonna 'Truth Or Dare' Shoe Line To Debut Thi...,https://www.huffingtonpost.com/entry/madonna-s...,"Madonna is slinking her way into footwear now,...","[madonna, slink, way, footwear, truth, dare, p..."
3,3,"Victor and Mary, Contributor\n2Sense-LA.com",TRAVEL,2013-12-17,Sophistication and Serenity on the Las Vegas S...,https://www.huffingtonpost.com/entry/las-vegas...,But what if you're a 30-something couple that ...,"[something, couple, shy, away, table, dance, e..."
4,4,"Emily Cohn, Contributor",BUSINESS,2015-03-19,It's Still Pretty Hard For Women To Get Free B...,https://www.huffingtonpost.com/entry/free-birt...,Obamacare was supposed to make birth control f...,"[obamacare, suppose, make, birth, control, fre..."


We need to separate our data into two parts. the train set and evaluate set. we use 80 percent of data as train set and 20 percent as evaluate set. we want to have all sorts of news. if we choose 80 percent of real data as train set, we may have a set with only one or two categories, therefore, our model can not detect other categories as well. we have to create a train set with good diversity from all of the categories. 

for each category, we choose a random subset of that category (80 percent) and then combine those subsets to create our train set and the remaining 20 percent of each category for evaluate set.

In [146]:
travel = dataFrame[dataFrame['category'] == 'TRAVEL']
business = dataFrame[dataFrame['category'] == 'BUSINESS']
sb = dataFrame[dataFrame['category'] == 'STYLE & BEAUTY']
trainTravel = travel.sample(frac=0.8)
evaluateTravel = travel.drop(trainTravel.index)
trainBusiness = business.sample(frac=0.8)
evaluateBusiness = business.drop(trainBusiness.index)
trainSB = sb.sample(frac=0.8)
evaluateSB = sb.drop(trainSB.index)
evaluateData = pd.concat([evaluateSB,evaluateBusiness, evaluateTravel])
evaluateData = evaluateData.sample(frac=1)

In [147]:
q = trainBusiness.sample(frac=0.7,replace=True)
trainBusiness = pd.concat([trainBusiness,q],ignore_index=True)

After creating our train set, we should train our model with this set. 
for this purpose, we use Bayes rule.

$$ P(c|x) = \frac{P(x|c)\times P(c)}{P(x)} $$

As we can see, Bayes rule has 4 parts: Posterior, Likelihood, Prior and Evidence.in order to use this rule, we should define each part of the Bayes rule for our project.

The posterior probability is the probability of a category given the words in the news. we use Bayes rule to calculate this probability.

$$ posterior = {P(category | x_0, x_1, x_2, ...,  x_n)} $$

which is $x_n$ is the n-th word of that news.

We define the probability of each category as the prior probability which means how probable is it for a news to be for a certain category.

We define the likelihood as the probability of each word of a news given the category which means how probable it is for that category to use that word. in other words, the likelihood probability is:

$$ likelihood = {P(x_0, x_1, x_2, ...,  x_n|category)} $$

Since the probability of existing a word in a certain category is independent of the probability of existing another word in that category for each news, we can multiply these probabilities to calculate our conditional probability.

$$ likelihood = {P(x_0|category)\times P(x_1|category) \times P(x_2|category) \times ... \times P(x_n|category)} $$

The evidence is the probabiliy of all words that we have in a given news.

$$ evidence = {P(x_0, x_1, x_2, ...,  x_n)} $$

Since we are going to compare the posterior probabilities for each category and in each category the evidence probability is the same as other categories, thus we don't need to calculate the evidence probability.



##  Part $\mathrm{I}$

In this part, we want to train our model for only two categories. TRAVEL category and BUSINESS category.

In [148]:
travelWords = trainTravel.words.values
travelWords = list(itertools.chain.from_iterable(travelWords))
BusinessWords = trainBusiness.words.values
BusinessWords = list(itertools.chain.from_iterable(BusinessWords))
allTravelAndBusinessWord = list(map(''.join, set(itertools.chain(travelWords, BusinessWords))))
travelWordsCount = dict(collections.Counter(travelWords))
BusinessWordsCount = dict(collections.Counter(BusinessWords))
newDataFrame = pd.DataFrame(columns=['Word','Travel occurrences','Business occurrences'])
for word in allTravelAndBusinessWord:
    to = 0
    bo = 0
    if word in travelWordsCount:
        to = travelWordsCount[word]
    if word in BusinessWordsCount:
        bo = BusinessWordsCount[word]
    newDataFrame = newDataFrame.append({'Word' : word,'Travel occurrences' : to,'Business occurrences' : bo},ignore_index=True)


We create a new data frame. for each word, we calculate the probability of that word given the category. in fact, we calculate $ P(word | category) $ for each word in our training set (in this case out training set contain only two mentioned categories).


### Laplace

For calculating the conditional probability we use the Laplace Smoothing. since we multiply the conditional probabilities in order to calculate the likelihood, if we have a word that used only one time in one category, the likelihood probability for other categories will be zero even if other conditional probabilities have a high value. Laplace smoothing is used to solve the problem of zero probability.

$$ P(word|category) =  \frac{O(word,category) + \alpha}{S(words,categor) + |A|\alpha} $$

We use this formula to calculate the conditional probability. O(word, category) is the word occurrences in that category, S(words, category) is the number of all words in that category and |A| is the number of distinct words in our train set.
alpha is a constant which is used to solve zero probability problem.

In [149]:
alpha = 0.5
newDataFrame['Travel Probability'] = (newDataFrame['Travel occurrences'] + alpha) / (newDataFrame['Travel occurrences'].sum() + (len(set(travelWords + BusinessWords))*alpha))
newDataFrame['Business Probability'] = (newDataFrame['Business occurrences'] + alpha) / (newDataFrame['Business occurrences'].sum() + (len(set(travelWords + BusinessWords))*alpha))
newDataFrame = newDataFrame.set_index('Word')
newDataFrame

Unnamed: 0_level_0,Travel occurrences,Business occurrences,Travel Probability,Business Probability
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
família,2,0,2.222e-05,4.64803e-06
join,27,40,0.000244419,0.00037649
bellavista,1,0,1.3332e-05,4.64803e-06
lighter,2,0,2.222e-05,4.64803e-06
luminous,2,0,2.222e-05,4.64803e-06
...,...,...,...,...
leary,3,1,3.11079e-05,1.39441e-05
fairness,0,4,4.44399e-06,4.18323e-05
whoever,3,0,3.11079e-05,4.64803e-06
irkutsk,1,0,1.3332e-05,4.64803e-06


In [150]:
travelAndBusinessEvaluateData = pd.concat([evaluateBusiness, evaluateTravel])
travelAndBusinessEvaluateData = travelAndBusinessEvaluateData.sample(frac=1)

After calculating the conditional probability our model is ready to test with evaluate data. for each news in evaluate set, we calculate the prior probability and for all of the words in that news, we multiply the conditional probability with prior probability in order to calculate the posterior probability. After that, we can predict the category for each news. the category with higher posterior probability is our model prediction.

In [151]:
for index,row in travelAndBusinessEvaluateData.iterrows():
    travelPriorProbability = len(trainTravel)/(len(trainTravel)+len(trainBusiness))
    businessPriorProbability = len(trainBusiness)/(len(trainTravel)+len(trainBusiness))
    words = set(row['words'])
    for word in words:
        if word in allTravelAndBusinessWord:
            travelPriorProbability *= newDataFrame.at[word,'Travel Probability']
            businessPriorProbability *= newDataFrame.at[word,'Business Probability']

    travelAndBusinessEvaluateData.at[index,'Travel Probability'] = travelPriorProbability
    travelAndBusinessEvaluateData.at[index,'Business Probability'] = businessPriorProbability
    if travelPriorProbability >= businessPriorProbability:
        travelAndBusinessEvaluateData.at[index,'Prediction'] = 'TRAVEL'
    else:
        travelAndBusinessEvaluateData.at[index,'Prediction'] = 'BUSINESS'
travelAndBusinessEvaluateData.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description,words,Travel Probability,Business Probability,Prediction
15799,15799,"Wendy Smith, ContributorPioneer and passionate...",BUSINESS,2016-10-15,Corporate's Responsibility Toward Social Susta...,https://www.huffingtonpost.com/entry/corporate...,Today we live in a world where technology cont...,"[today, live, world, technology, continue, sim...",9.665179000000001e-39,2.681452e-37,BUSINESS
10951,10951,"simon confino, ContributorFounder Director We-...",BUSINESS,2016-09-05,Selfish Donald Trump versus Selfless Mother Te...,https://www.huffingtonpost.com/entry/selfish-d...,Mother Teresa has this week become a saint. A ...,"[mother, teresa, week, become, saint, saint, s...",1.441085e-96,9.753046e-91,BUSINESS
21037,21037,"Erica Firpo, Contributor\nTravel Journalist ba...",TRAVEL,2013-03-11,5 Museums In Rome To Visit While The Sistine C...,https://www.huffingtonpost.com/entry/5-museums...,Even during the frenzy of a conclave Rome has ...,"[even, frenzy, conclave, rome, art, clergy]",6.904126e-15,2.1190220000000002e-17,TRAVEL
9394,9394,"Anne Z. Cooke, Contributor\nTravel & Feature J...",TRAVEL,2013-07-17,Untamed in Ucluelet,https://www.huffingtonpost.com/entry/untamed-i...,"At first glance, Ucluelet looks like any other...","[first, glance, ucluelet, look, like, end, roa...",3.286041e-74,9.935505999999999e-87,TRAVEL
2965,2965,"Susan Portnoy, ContributorThe Insatiable Traveler",TRAVEL,2015-03-05,11 Great Pre-Trip Prep Tips to Start Your Trav...,https://www.huffingtonpost.com/entry/travel-pr...,"Alexander Graham Bell once said, ""Before anyth...","[alexander, graham, bell, say, anything, else,...",7.203075000000001e-75,1.098199e-74,BUSINESS


In [168]:
#Travel
correctTravel = travelAndBusinessEvaluateData.loc[(travelAndBusinessEvaluateData['category'] == 'TRAVEL') & (travelAndBusinessEvaluateData['Prediction'] == 'TRAVEL')]
correctTravel = len(correctTravel)
allOfTravel = (travelAndBusinessEvaluateData['category'] == 'TRAVEL').sum()
allTravelPrediction = (travelAndBusinessEvaluateData['Prediction'] == 'TRAVEL').sum()
travelRecall = correctTravel/allOfTravel
travelPrecision = correctTravel/allTravelPrediction
print(travelRecall,travelPrecision)
#Business
correctBusiness = travelAndBusinessEvaluateData.loc[(travelAndBusinessEvaluateData['category'] == 'BUSINESS') & (travelAndBusinessEvaluateData['Prediction'] == 'BUSINESS')]
correctBusiness = len(correctBusiness)
allOfBusiness = (travelAndBusinessEvaluateData['category'] == 'BUSINESS').sum()
allBusinessPrediction = (travelAndBusinessEvaluateData['Prediction'] == 'BUSINESS').sum()
BusinessRecall = correctBusiness/allOfBusiness
BusinessPrecision = correctBusiness/allBusinessPrediction
print(BusinessRecall,BusinessPrecision)
travelAndBusinessEvaluateData['correct'] = (travelAndBusinessEvaluateData['category'] == travelAndBusinessEvaluateData['Prediction'])
correctDetected = (travelAndBusinessEvaluateData['correct']).sum()
accuracy =  correctDetected / len(travelAndBusinessEvaluateData)
print(accuracy)

0.8657303370786517 0.9524103831891224
0.9279700654817586 0.8058489033306255
0.8890838890838891


| phase1 | Travel  | Business  |
| --- | --- | --- |
| Recall | %86 | %95 |
| Precision | %93 | %80
| Accuracy   | %89 |  %89  |

##  Part $\mathrm{II}$

Now we want to add third category to our model. we add STYLE & BEAUTY category to our train set and repeat all the process again. 

In [153]:
sbWords = trainSB.words.values
sbWords = list(itertools.chain.from_iterable(sbWords))
allTrainDataWords = list(map(''.join, set(itertools.chain(travelWords, BusinessWords,sbWords))))
sbWordsCount = dict(collections.Counter(sbWords))
travelWordsCount = dict(collections.Counter(travelWords))
BusinessWordsCount = dict(collections.Counter(BusinessWords))
allDataFrame = pd.DataFrame(columns=['Word','Travel occurrences','Business occurrences','Style & Beauty occurrences'])
for word in allTrainDataWords:
    to = 0
    sbo = 0
    bo = 0
    if word in travelWordsCount:
        to = travelWordsCount[word]
    if word in BusinessWordsCount:
        bo = BusinessWordsCount[word]
    if word in sbWordsCount:
        sbo = sbWordsCount[word]
    allDataFrame = allDataFrame.append({'Word' : word,'Travel occurrences' : to,'Business occurrences' : bo,'Style & Beauty occurrences' : sbo},ignore_index=True)


We also use Laplace smoothing here too.

In [154]:
alpha = 0.5
allDataFrame['Travel Probability'] = (allDataFrame['Travel occurrences'] + alpha) / (allDataFrame['Travel occurrences'].sum() + (len(set(travelWords + BusinessWords + sbWords))*alpha))
allDataFrame['Business Probability'] = (allDataFrame['Business occurrences'] + alpha) / (allDataFrame['Business occurrences'].sum() + (len(set(travelWords + BusinessWords + sbWords))*alpha))
allDataFrame['Style & Beauty Probability'] = (allDataFrame['Style & Beauty occurrences'] + alpha) / (allDataFrame['Style & Beauty occurrences'].sum() + (len(set(travelWords + BusinessWords + sbWords))*alpha))
allDataFrame = allDataFrame.set_index('Word')
allDataFrame.head()

Unnamed: 0_level_0,Travel occurrences,Business occurrences,Style & Beauty occurrences,Travel Probability,Business Probability,Style & Beauty Probability
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
luminous,2,0,5,2.17942e-05,4.55496e-06,5.65527e-05
refinancers,0,5,0,4.35884e-06,5.01045e-05,5.14115e-06
tack,3,0,1,3.05119e-05,4.55496e-06,1.54235e-05
puzzle,4,0,1,3.92295e-05,4.55496e-06,1.54235e-05
lineup,2,0,0,2.17942e-05,4.55496e-06,5.14115e-06


Now we can evaluate our model with our evaluation set which has all three category.

In [155]:
for index,row in evaluateData.iterrows():
    travelPriorProbability = len(trainTravel)/(len(trainTravel)+len(trainBusiness)+len(trainSB))
    businessPriorProbability = len(trainBusiness)/(len(trainTravel)+len(trainBusiness)+len(trainSB))
    sbPriorProbability = len(trainSB)/(len(trainTravel)+len(trainBusiness)+len(trainSB))
    words = set(row['words'])
    for word in words:
        if word in allTrainDataWords:
            travelPriorProbability *= allDataFrame.at[word,'Travel Probability']
            businessPriorProbability *= allDataFrame.at[word,'Business Probability']
            sbPriorProbability *= allDataFrame.at[word,'Style & Beauty Probability']
    evaluateData.at[index,'Travel Probability'] = travelPriorProbability
    evaluateData.at[index,'Business Probability'] = businessPriorProbability
    evaluateData.at[index,'Style & Beauty Probability'] = sbPriorProbability
    if travelPriorProbability >= businessPriorProbability and travelPriorProbability >= sbPriorProbability:
        evaluateData.at[index,'Prediction'] = 'TRAVEL'
    if businessPriorProbability >= travelPriorProbability and businessPriorProbability >= sbPriorProbability:
        evaluateData.at[index,'Prediction'] = 'BUSINESS'
    if sbPriorProbability >= travelPriorProbability and sbPriorProbability>= businessPriorProbability:
        evaluateData.at[index,'Prediction'] = 'STYLE & BEAUTY'


In [156]:
evaluateData.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description,words,Travel Probability,Business Probability,Style & Beauty Probability,Prediction
16075,16075,,STYLE & BEAUTY,2012-08-13,Victoria Beckham Awkwardly Reunites With Spice...,https://www.huffingtonpost.com/entry/spice-gir...,But the power quintet reunited as the Girls of...,"[power, quintet, reunite, girls, spice, london...",1.170543e-43,6.798945e-47,9.233991e-42,STYLE & BEAUTY
4603,4603,"Pierre R. Berastaín, Contributor\nUndocumented...",TRAVEL,2012-09-08,Meeting Amtrak's Cross-Country Passengers,https://www.huffingtonpost.com/entry/amtrak-am...,The different social spaces within the train a...,"[different, social, space, within, train, allo...",9.062897999999999e-51,1.9889689999999998e-48,1.45222e-55,BUSINESS
12692,12692,"Richard Wiese, Contributor\nHost of Born to Ex...",TRAVEL,2012-07-31,Born to Explore: Eating Rotten Shark With A Vi...,https://www.huffingtonpost.com/entry/rotten-ic...,"I'm no food critic, but imagine eating the con...","[food, critic, imagine, eat, content, bait, bo...",1.217144e-42,8.45085e-46,1.020622e-45,TRAVEL
2580,2580,"Mike Arkus, ContributorJournalist",TRAVEL,2016-05-29,The Mediaeval Greek Fortress Town of Monemvasi...,https://www.huffingtonpost.com/entry/the-media...,Enough of 'equivalents' already. Lonely Planet...,"[enough, equivalents, already, lonely, planet,...",8.168402000000001e-106,1.078084e-119,1.815601e-117,TRAVEL
5394,5394,"BnBFinder.com, Contributor\nBnBFinder",TRAVEL,2012-09-25,Leaf Peeping From The Porch (PHOTOS),https://www.huffingtonpost.com/entry/taking-th...,"You know it's coming, but every year it still ...","[know, come, every, year, still, feel, like, c...",9.169448e-53,1.3125400000000002e-54,1.140317e-53,TRAVEL


In [169]:
#Travel
correctTravel = evaluateData.loc[(evaluateData['category'] == 'TRAVEL') & (evaluateData['Prediction'] == 'TRAVEL')]
correctTravelSum = len(correctTravel)
allOfTravel = (evaluateData['category'] == 'TRAVEL').sum()
allTravelPrediction = (evaluateData['Prediction'] == 'TRAVEL').sum()
travelRecall = correctTravelSum/allOfTravel
travelPrecision = correctTravelSum/allTravelPrediction
print(travelRecall,travelPrecision)
#Business
correctBusiness = evaluateData.loc[(evaluateData['category'] == 'BUSINESS') & (evaluateData['Prediction'] == 'BUSINESS')]
correctBusinessSum = len(correctBusiness)
allOfBusiness = (evaluateData['category'] == 'BUSINESS').sum()
allBusinessPrediction = (evaluateData['Prediction'] == 'BUSINESS').sum()
BusinessRecall = correctBusinessSum/allOfBusiness
BusinessPrecision = correctBusinessSum/allBusinessPrediction
print(BusinessRecall,BusinessPrecision)
#Style
correctSB = evaluateData.loc[(evaluateData['category'] == 'STYLE & BEAUTY') & (evaluateData['Prediction'] == 'STYLE & BEAUTY')]
correctSBSum = len(correctSB)
allOfSB = (evaluateData['category'] == 'STYLE & BEAUTY').sum()
allSBPrediction = (evaluateData['Prediction'] == 'STYLE & BEAUTY').sum()
SBRecall = correctSBSum/allOfSB
SBPrecision = correctSBSum/allSBPrediction
print(SBRecall,SBPrecision)

evaluateData['correct'] = (evaluateData['category'] == evaluateData['Prediction'])
correctDetected = (evaluateData['correct']).sum()
accuracy =  correctDetected / len(evaluateData)
print(accuracy)

0.8264044943820225 0.8652941176470588
0.9055191768007483 0.7457627118644068
0.842832469775475 0.9219143576826196
0.8510684692542521


| phase1 | Travel  | Business | Style & Beauty |
| --- | --- | --- | --- |
| Recall | %83 | %90 | %84 |
| Precision | %86 | %74 | %92 |
| Accuracy   | %85 | %85 | %85 |

### Confusion Matrix

Confusion matrix is a summary of prediction results on a classification problem.each row of the matrix corresponds to a predicted class and each column of the matrix corresponds to an actual class. confusion matrix helps us to have a better view of our model.

In [170]:
wrongTravelB = evaluateData.loc[(evaluateData['category'] == 'TRAVEL') & (evaluateData['Prediction'] == 'BUSINESS')]
wrongTravelS = evaluateData.loc[(evaluateData['category'] == 'TRAVEL') & (evaluateData['Prediction'] == 'STYLE & BEAUTY')]
wrongBusinessT = evaluateData.loc[(evaluateData['category'] == 'BUSINESS') & (evaluateData['Prediction'] == 'TRAVEL')]
wrongBusinessS = evaluateData.loc[(evaluateData['category'] == 'BUSINESS') & (evaluateData['Prediction'] == 'STYLE & BEAUTY')]
wrongSBT = evaluateData.loc[(evaluateData['category'] == 'STYLE & BEAUTY') & (evaluateData['Prediction'] == 'TRAVEL')]
wrongSBB = evaluateData.loc[(evaluateData['category'] == 'STYLE & BEAUTY') & (evaluateData['Prediction'] == 'BUSINESS')]
print('TRAVEL: ',len(correctTravel),len(wrongTravelB),len(wrongTravelS))
print('BUSINESS: ',len(correctBusiness),len(wrongBusinessT),len(wrongBusinessS))
print('STYLE & BEAUTY: ',len(correctSB),len(wrongSBT),len(wrongSBB))

TRAVEL:  1471 221 88
BUSINESS:  968 65 36
STYLE & BEAUTY:  1464 164 109


|  | Travel  | Business | Style & Beauty |
| --- | --- | --- | --- |
| Travel | 1471 | 221 | 88 |
| Business | 65 | 968 | 36 |
| Style & Beauty   | 164 | 109 | 1464 |

## Sampling

If we check the recall and precision for each category we see that there is a difference between these values. this is because, in our training data, the distribution between all categories is not the same and some class exists more than the others. for overcoming this problem, we have two options. we can randomly duplicate examples in the minority class or randomly delete examples in the majority class. after that, we have a balance data set.

## Final Evaluation 

We have a dataset (test.csv) which is similar to our train dataset but has no category. we want to predict the category of news for this dataset with our trained model.

In [159]:
testDataFrame = pd.read_csv('Attachment/test.csv')
testDataFrame = cleanText(testDataFrame)
testDataFrame = testDataFrame.dropna(subset=['short_description'])
for index,row in testDataFrame.iterrows():
    travelPriorProbability = len(trainTravel)/(len(trainTravel)+len(trainBusiness)+len(trainSB))
    businessPriorProbability = len(trainBusiness)/(len(trainTravel)+len(trainBusiness)+len(trainSB))
    sbPriorProbability = len(trainSB)/(len(trainTravel)+len(trainBusiness)+len(trainSB))
    words = set(row['words'])
    for word in words:
        if word in allTrainDataWords:
            travelPriorProbability *= allDataFrame.at[word,'Travel Probability']
            businessPriorProbability *= allDataFrame.at[word,'Business Probability']
            sbPriorProbability *= allDataFrame.at[word,'Style & Beauty Probability']
    testDataFrame.at[index,'Travel Probability'] = travelPriorProbability
    testDataFrame.at[index,'Business Probability'] = businessPriorProbability
    testDataFrame.at[index,'Style & Beauty Probability'] = sbPriorProbability
    if travelPriorProbability >= businessPriorProbability and travelPriorProbability >= sbPriorProbability:
        testDataFrame.at[index,'category'] = 'TRAVEL'
    if businessPriorProbability >= travelPriorProbability and businessPriorProbability >= sbPriorProbability:
        testDataFrame.at[index,'category'] = 'BUSINESS'
    if sbPriorProbability >= travelPriorProbability and sbPriorProbability>= businessPriorProbability:
        testDataFrame.at[index,'category'] = 'STYLE & BEAUTY'

Now we write the category prediction into the ouptut.csv file.

In [160]:
output = pd.DataFrame({"index": testDataFrame['index'],"category": testDataFrame['category']})
output.to_csv('output.csv', index=False)
output

Unnamed: 0,index,category
0,0,STYLE & BEAUTY
1,1,TRAVEL
4,4,STYLE & BEAUTY
5,5,TRAVEL
6,6,STYLE & BEAUTY
...,...,...
2543,2543,BUSINESS
2544,2544,TRAVEL
2545,2545,BUSINESS
2546,2546,TRAVEL


## Questions

### 1

Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language. On the other hand, Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. The result of Lemmatization is called lemma which is dictionary form, or citation form of the words. 
We use Lemmatization in our project because we interested in word frequency and want to treat all forms of a word as one. in stemming some forms of a word may not have the same stem and that can influence the frequency of that word and the accuracy of our model as well. 

### 2

tf-idf is short for "term frequency-inverse document frequency" which basically reflects how important a word is to a document.
tf measures how frequently a term occurs in a document.

TF = (Number of times term t appears in a document) / (Total number of terms in the document)
idf measures how important a term is.

IDF = $log_e$(Total number of documents / Number of documents with term t in it)

In Bayesian model, when we calculate the probabilities based on each word occurrences, each word in each document (in our case each news) counted as one. if we want to use tf-idf, instead of counting each word as one, we use the tf-idf weight.

### 3

We can get a hundred percent precision for certain a category even if our model cannot predict the categories correctly. for example, if we predict one TRAVEL news correctly and then assign all other news to other categories (e.g BUSINESS) then the precision of the TRAVEL will be a hundred percent although our predictions are wrong.

### 4

We referred to this problem earlier and the Laplace smoothing as a solution.