# Artificial Intelligence - Computer Assignment 3 - Bayesian Network

## Mehrdad Nourbakhsh(810194418)

In this project, we want to use Bayes rule for classifying news based on a short description of that news.

In [1]:
import numpy as np
import pandas as pd
import itertools
import collections
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

We have a dataset with nearly 23000 news. each news has a category. we want to create a model to classify the category of that news based on the short description of that news. we read the dataset from .csv file and load that into a panda dataframe.

In [2]:
dataFrame = pd.read_csv('Attachment/data.csv')
dataFrame.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description
0,0,"Katherine LaGrave, ContributorTravel writer an...",TRAVEL,2014-05-07,"EccentriCities: Bingo Parties, Paella and Isla...",https://www.huffingtonpost.com/entry/eccentric...,Påskekrim is merely the tip of the proverbial ...
1,1,Ben Hallman,BUSINESS,2014-06-09,Lawyers Are Now The Driving Force Behind Mortg...,https://www.huffingtonpost.com/entry/mortgage-...,
2,2,Jessica Misener,STYLE & BEAUTY,2012-03-12,Madonna 'Truth Or Dare' Shoe Line To Debut Thi...,https://www.huffingtonpost.com/entry/madonna-s...,"Madonna is slinking her way into footwear now,..."
3,3,"Victor and Mary, Contributor\n2Sense-LA.com",TRAVEL,2013-12-17,Sophistication and Serenity on the Las Vegas S...,https://www.huffingtonpost.com/entry/las-vegas...,But what if you're a 30-something couple that ...
4,4,"Emily Cohn, Contributor",BUSINESS,2015-03-19,It's Still Pretty Hard For Women To Get Free B...,https://www.huffingtonpost.com/entry/free-birt...,Obamacare was supposed to make birth control f...


We can see that each news has some other information too rather than the short description (e.g headline, authors). but we ignore them and use only words of the short description to train our model.

First, we should preprocess the data and use the result for the model. we want to normalize the short description.
we use nltk library for normalizing the text.normalization include these steps:
* converting uppercase to lowercase
* removing punctuation signs
* removing numbers
* converting each text into a list of words
* removing stop words
* using lemmatization to remove inflectional endings only and to return the base form of a word

If we don't convert all words to lowercase, our model might treat a word which is at the beginning of a sentence with a capital letter, different from the same word which appears later in the sentence but without any capital latter. this might influence our model accuracy. thus we convert all letters to lowercase.

In our model, we use words frequency and occurrences of them in the text. we want to find relevant results not only for the exact expression but also for the other possible forms of the words we used. for this purpose we use lemmatization. lemmatization helps us to treat all possible forms of a word as an individual word and this can improve our model with increasing word frequency for each category.

In [3]:
data = dataFrame.short_description
data = data.fillna('')
data = data.str.lower()
data = data.str.replace('[^\w\s]',' ')
data = data.str.replace('\d+', '')
stopWords = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
wordsList = data.apply(word_tokenize)
wordsList = wordsList.apply(lambda x: [item for item in x if item not in stopWords])
wordsList = wordsList.apply(lambda x: [lemmatizer.lemmatize(y, pos="v") for y in x])
dataFrame['words'] = wordsList
dataFrame.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description,words
0,0,"Katherine LaGrave, ContributorTravel writer an...",TRAVEL,2014-05-07,"EccentriCities: Bingo Parties, Paella and Isla...",https://www.huffingtonpost.com/entry/eccentric...,Påskekrim is merely the tip of the proverbial ...,"[påskekrim, merely, tip, proverbial, iceberg, ..."
1,1,Ben Hallman,BUSINESS,2014-06-09,Lawyers Are Now The Driving Force Behind Mortg...,https://www.huffingtonpost.com/entry/mortgage-...,,[]
2,2,Jessica Misener,STYLE & BEAUTY,2012-03-12,Madonna 'Truth Or Dare' Shoe Line To Debut Thi...,https://www.huffingtonpost.com/entry/madonna-s...,"Madonna is slinking her way into footwear now,...","[madonna, slink, way, footwear, truth, dare, p..."
3,3,"Victor and Mary, Contributor\n2Sense-LA.com",TRAVEL,2013-12-17,Sophistication and Serenity on the Las Vegas S...,https://www.huffingtonpost.com/entry/las-vegas...,But what if you're a 30-something couple that ...,"[something, couple, shy, away, table, dance, e..."
4,4,"Emily Cohn, Contributor",BUSINESS,2015-03-19,It's Still Pretty Hard For Women To Get Free B...,https://www.huffingtonpost.com/entry/free-birt...,Obamacare was supposed to make birth control f...,"[obamacare, suppose, make, birth, control, fre..."


We need to separate our data into two parts. the train set and evaluate set. we use 80 percent of data as train set and 20 percent as evaluate set. we want to have all sorts of news. if we choose 80 percent of real data as train set, we may have a set with only one or two categories, therefore, our model can not detect other categories as well. we have to create a train set with good diversity from all of the categories. 

for each category, we choose a random subset of that category (80 percent) and then combine those subsets to create our train set and the remaining 20 percent of each category for evaluate set.

In [4]:
travel = dataFrame[dataFrame['category'] == 'TRAVEL']
business = dataFrame[dataFrame['category'] == 'BUSINESS']
sb = dataFrame[dataFrame['category'] == 'STYLE & BEAUTY']
trainTravel = travel.sample(frac=0.8)
evaluateTravel = travel.drop(trainTravel.index)
trainBusiness = business.sample(frac=0.8)
evaluateBusiness = business.drop(trainBusiness.index)
trainSB = sb.sample(frac=0.8)
evaluateSB = sb.drop(trainSB.index)
# trainData = pd.concat([trainTravel,trainBusiness, trainSB])
# trainData = trainData.sample(frac=1)
evaluateData = pd.concat([evaluateSB,evaluateBusiness, evaluateTravel])
evaluateData = evaluateData.sample(frac=1)
# trainData.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description,words
6155,6155,"Dr. Patty Ann Tublin, ContributorRelationship ...",BUSINESS,2015-03-06,'Paleo-ing' Your Business and Career Is the Ke...,https://www.huffingtonpost.com/entry/paleoing-...,So what could paleo eating and success in your...,"[could, paleo, eat, success, business, career,..."
20153,20153,Suzy Strutner,TRAVEL,2014-10-05,50 Fictional Places You Can Visit In Real Life,https://www.huffingtonpost.com/entry/movie-loc...,,[]
15758,15758,,TRAVEL,2012-07-19,"Marrakech, Morocco Sees Hotel Boom (PHOTOS)",https://www.huffingtonpost.com/entry/marrakech...,Marrakech has been having a moment (in the tra...,"[marrakech, moment, travel, press, least, sinc..."
21273,21273,"Ann Francke, Contributor\nManagement Expert an...",BUSINESS,2013-03-02,Why Marissa Mayer Makes Me Mad as a Mom and a ...,https://www.huffingtonpost.com/entry/why-maris...,"Mayer's decision, ironically, is a huge diss o...","[mayer, decision, ironically, huge, diss, prod..."
647,647,Michelle Persad,STYLE & BEAUTY,2012-11-17,Isla Fisher Channels Old Hollywood Glamour (PH...,https://www.huffingtonpost.com/entry/isla-fish...,Red lipstick never looked so good.,"[red, lipstick, never, look, good]"


After creating our train set, we should train our model with this set. 
for this purpose, we use Bayes rule.

$$ P(c|x) = \frac{P(x|c)\times P(c)}{P(x)} $$

As we can see, Bayes rule has 4 parts: Posterior, Likelihood, Prior and Evidence.in order to use this rule, we should define each part of the Bayes rule for our project.

The posterior probability is the probability of a category given the words in the news. we use Bayes rule to calculate this probability.

$$ posterior = {P(category | x_0, x_1, x_2, ...,  x_n)} $$

which is $x_n$ is the n-th word of that news.

We define the probability of each category as the prior probability which means how probable is it for a news to be for a certain category.

We define the likelihood as the probability of each word of a news given the category which means how probable it is for that category to use that word. in other words, the likelihood probability is:

$$ likelihood = {P(x_0, x_1, x_2, ...,  x_n|category)} $$

Since the probability of existing a word in a certain category is independent of the probability of existing another word in that category for each news, we can multiply these probabilities to calculate our conditional probability.

$$ likelihood = {P(x_0|category)\times P(x_1|category) \times P(x_2|category) \times ... \times P(x_n|category)} $$

The evidence is the probabiliy of all words that we have in a given news.

$$ evidence = {P(x_0, x_1, x_2, ...,  x_n)} $$

Since we are going to compare the posterior probabilities for each category and in each category the evidence probability is the same as other categories, thus we don't need to calculate the evidence probability.



##  Part $\mathrm{I}$

In this part, we want to train our model for only two categories. TRAVEL category and BUSINESS category.

In [5]:
travelWords = trainTravel.words.values
travelWords = list(itertools.chain.from_iterable(travelWords))
BusinessWords = trainBusiness.words.values
BusinessWords = list(itertools.chain.from_iterable(BusinessWords))
allTravelAndBusinessWord = list(map(''.join, set(itertools.chain(travelWords, BusinessWords))))
travelWordsCount = dict(collections.Counter(travelWords))
BusinessWordsCount = dict(collections.Counter(BusinessWords))
newDataFrame = pd.DataFrame(columns=['Word','Travel occurrences','Business occurrences'])
for word in allTravelAndBusinessWord:
    to = 0
    bo = 0
    if word in travelWordsCount:
        to = travelWordsCount[word]
    if word in BusinessWordsCount:
        bo = BusinessWordsCount[word]
    newDataFrame = newDataFrame.append({'Word' : word,'Travel occurrences' : to,'Business occurrences' : bo},ignore_index=True)


We create a new data frame. for each word, we calculate the probability of that word given the category. in fact, we calculate $ P(word | category) $ for each word in our training set (in this case out training set contain only two mentioned categories).


For calculating the conditional probability we use the Laplace Smoothing. since we multiply the conditional probabilities in order to calculate the likelihood, if we have a word that used only one time in one category, the likelihood probability for other categories will be zero even if other conditional probabilities have a high value. Laplace smoothing is used to solve the problem of zero probability.

$$ P(word|category) =  \frac{O(word,category) + \alpha}{S(words,categor) + |A|\alpha} $$

We use this formula to calculate the conditional probability. O(word, category) is the word occurrences in that category, S(words, category) is the number of all words in that category and |A| is the number of distinct words in our train set.
alpha is a constant which is used to solve zero probability problem.

In [6]:
alpha = 0.5
newDataFrame['Travel Probability'] = (newDataFrame['Travel occurrences'] + alpha) / (newDataFrame['Travel occurrences'].sum() + (len(set(travelWords + BusinessWords))*alpha))
newDataFrame['Business Probability'] = (newDataFrame['Business occurrences'] + alpha) / (newDataFrame['Business occurrences'].sum() + (len(set(travelWords + BusinessWords))*alpha))
newDataFrame = newDataFrame.set_index('Word')
newDataFrame

Unnamed: 0_level_0,Travel occurrences,Business occurrences,Travel Probability,Business Probability
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
farris,0,1,4.46652e-06,2.45833e-05
entrepreneurship,0,12,4.46652e-06,0.000204861
walmart,0,33,4.46652e-06,0.000549027
martyr,2,0,2.23326e-05,8.19444e-06
shaft,0,1,4.46652e-06,2.45833e-05
...,...,...,...,...
frequencies,1,0,1.33996e-05,8.19444e-06
bicycle,8,0,7.59308e-05,8.19444e-06
carefully,2,3,2.23326e-05,5.73611e-05
file,7,19,6.69978e-05,0.000319583


In [7]:
travelAndBusinessEvaluateData = pd.concat([evaluateBusiness, evaluateTravel])
travelAndBusinessEvaluateData = travelAndBusinessEvaluateData.sample(frac=1)


After calculating the conditional probability our model is ready to test with evaluate data. for each news in evaluate set, we calculate the prior probability and for all of the words in that news, we multiply the conditional probability with prior probability in order to calculate the posterior probability. After that, we can predict the category for each news. the category with higher posterior probability is our model prediction.

In [8]:
for index,row in travelAndBusinessEvaluateData.iterrows():
    travelPriorProbability = len(trainTravel)/(len(trainTravel)+len(trainBusiness))
    businessPriorProbability = len(trainBusiness)/(len(trainTravel)+len(trainBusiness))
    words = set(row['words'])
    for word in words:
        if word in allTravelAndBusinessWord:
            travelPriorProbability *= newDataFrame.at[word,'Travel Probability']
            businessPriorProbability *= newDataFrame.at[word,'Business Probability']

    travelAndBusinessEvaluateData.at[index,'Travel Probability'] = travelPriorProbability
    travelAndBusinessEvaluateData.at[index,'Business Probability'] = businessPriorProbability
    if travelPriorProbability >= businessPriorProbability:
        travelAndBusinessEvaluateData.at[index,'Prediction'] = 'TRAVEL'
    else:
        travelAndBusinessEvaluateData.at[index,'Prediction'] = 'BUSINESS'
travelAndBusinessEvaluateData

Unnamed: 0,index,authors,category,date,headline,link,short_description,words,Travel Probability,Business Probability,Prediction
15144,15144,"Kari Astrid Haugeto, Contributor\nEntrepreneur...",TRAVEL,2013-08-04,Would You Take Your Kids to Las Vegas?,https://www.huffingtonpost.com/entry/las-vegas...,Vegas may not be considered the most family-fr...,"[vegas, may, consider, family, friendly, desti...",1.310353e-60,2.234789e-67,TRAVEL
10360,10360,,BUSINESS,2013-07-05,Paul Krugman: Here's 1 Thing That Hasn't Chang...,https://www.huffingtonpost.comhttp://www.nytim...,It's that time of year -- the long weekend whe...,"[time, year, long, weekend, gather, friends, f...",1.381914e-41,1.804279e-44,TRAVEL
21814,21814,"Anne Z. Cooke, ContributorTravel and adventure...",TRAVEL,2014-12-10,Mexico's No. 1 Baja Beach Resort: The Villa de...,https://www.huffingtonpost.com/entry/post_8715...,,[],6.248244e-01,3.751756e-01,TRAVEL
22644,22644,"Oyster.com, Contributor\nThe Hotel Tell-All",TRAVEL,2013-02-09,Trash Your Room At These Rock And Roll Hotels ...,https://www.huffingtonpost.com/entry/unleash-y...,"If you want to drop a beat, or just pick up th...","[want, drop, beat, pick, vibes, leave, behind,...",1.280896e-44,5.575987e-51,TRAVEL
13702,13702,"Christopher Elliott, Contributor\nAuthor, How ...",TRAVEL,2012-08-21,5 Things Not To Tell A TSA Screener (VIDEO),https://www.huffingtonpost.com/entry/5-things-...,"Dressing down a TSA agent at the airport, whil...","[dress, tsa, agent, airport, tempt, serve, use...",1.045361e-31,3.895727e-36,TRAVEL
...,...,...,...,...,...,...,...,...,...,...,...
20336,20336,"Tom Mulhall, Contributor\nNude Recreation Spec...",TRAVEL,2013-08-29,Is Topless Sunbathing Good for American Tourism?,https://www.huffingtonpost.com/entry/america-t...,Can American tourism afford not to allow tople...,"[american, tourism, afford, allow, topless, su...",2.719939e-23,9.349226e-27,TRAVEL
12135,12135,Suzy Strutner,TRAVEL,2013-09-25,6 Beautiful Canals You'll Probably Want To Vis...,https://www.huffingtonpost.com/entry/beautiful...,"2. Monmouthshire and Brecon, South Wales Peopl...","[monmouthshire, brecon, south, wales, people, ...",1.482118e-38,5.150690e-41,TRAVEL
2616,2616,"Peter Mandel, Contributor\nWashington Post con...",TRAVEL,2013-04-01,British Tourists Bitch About New York: Shoppin...,https://www.huffingtonpost.com/entry/british-t...,What do tourists from abroad think of travelin...,"[tourists, abroad, think, travel, keen, americ...",1.803517e-33,2.389699e-42,TRAVEL
652,652,,BUSINESS,2012-05-20,Is Insider Trading Part Of The Fabric On Wall ...,https://www.huffingtonpost.comhttp://www.nytim...,Federal authorities today are trumpeting effor...,"[federal, authorities, today, trumpet, efforts...",5.493070e-49,5.520852e-47,BUSINESS


In [9]:
#Travel
correctTravel = travelAndBusinessEvaluateData.loc[(travelAndBusinessEvaluateData['category'] == 'TRAVEL') & (travelAndBusinessEvaluateData['Prediction'] == 'TRAVEL')]
correctTravel = (correctTravel.all(axis='columns')).sum()
allOfTravel = (travelAndBusinessEvaluateData['category'] == 'TRAVEL').sum()
allTravelPrediction = (travelAndBusinessEvaluateData['Prediction'] == 'TRAVEL').sum()
travelRecall = correctTravel/allOfTravel
travelPrecision = correctTravel/allTravelPrediction
print(travelRecall,travelPrecision)
#Business
correctBusiness = travelAndBusinessEvaluateData.loc[(travelAndBusinessEvaluateData['category'] == 'BUSINESS') & (travelAndBusinessEvaluateData['Prediction'] == 'BUSINESS')]
correctBusiness = (correctBusiness.all(axis='columns')).sum()
allOfBusiness = (travelAndBusinessEvaluateData['category'] == 'BUSINESS').sum()
allBusinessPrediction = (travelAndBusinessEvaluateData['Prediction'] == 'BUSINESS').sum()
BusinessRecall = correctBusiness/allOfBusiness
BusinessPrecision = correctBusiness/allBusinessPrediction
print(BusinessRecall,BusinessPrecision)
travelAndBusinessEvaluateData['correct'] = (travelAndBusinessEvaluateData['category'] == travelAndBusinessEvaluateData['Prediction'])
correctDetected = (travelAndBusinessEvaluateData['correct']).sum()
accuracy =  correctDetected / len(travelAndBusinessEvaluateData)
print(accuracy)

0.8758426966292134 0.8128258602711157
0.7427502338634238 0.8528464017185822
0.8553878553878553


##  Part $\mathrm{II}$

Now we want to add third category to our model. we add STYLE & BEAUTY category to our train set and repeat all the process again. 

In [10]:
# sbWords = trainSB.words.values
# sbWords = list(itertools.chain.from_iterable(sbWords))
# allTrainDataWords = list(map(''.join, set(itertools.chain(travelWords, BusinessWords,sbWords))))
# print(len(travelWords),len(BusinessWords),len(sbWords),len(allTrainDataWords))
# sbWordsCount = dict(collections.Counter(sbWords))
# travelWordsCount = dict(collections.Counter(travelWords))
# BusinessWordsCount = dict(collections.Counter(BusinessWords))
# print(len(travelWordsCount),len(BusinessWordsCount),len(sbWordsCount),len(allTrainDataWords))
# allDataFrame = pd.DataFrame(columns=['Word','Travel occurrences','Business occurrences','Style & Beauty occurrences'])
# for word in allTrainDataWords:
#     to = 0
#     sbo = 0
#     bo = 0
#     if word in travelWordsCount:
#         to = travelWordsCount[word]
#     if word in BusinessWordsCount:
#         bo = BusinessWordsCount[word]
#     if word in sbWordsCount:
#         sbo = sbWordsCount[word]
#     allDataFrame = allDataFrame.append({'Word' : word,'Travel occurrences' : to,'Business occurrences' : bo,'Style & Beauty occurrences' : sbo},ignore_index=True)


We also use Laplace smoothing here too.

In [11]:
# alpha = 0.5
# allDataFrame['Travel Probability'] = (allDataFrame['Travel occurrences'] + alpha) / (allDataFrame['Travel occurrences'].sum() + (len(set(travelWords + BusinessWords + sbWords))*alpha))
# allDataFrame['Business Probability'] = allDataFrame['Business occurrences'] / (allDataFrame['Business occurrences'].sum() + (len(set(travelWords + BusinessWords + sbWords))*alpha))
# allDataFrame['Style & Beauty Probability'] = allDataFrame['Style & Beauty occurrences'] / (allDataFrame['Style & Beauty occurrences'].sum() + (len(set(travelWords + BusinessWords + sbWords))*alpha))
# allDataFrame = allDataFrame.set_index('Word')
# allDataFrame

Now we can evaluate our model with our evaluation set which has all three category.

In [12]:
# for index,row in evaluateData.iterrows():
#     travelPriorProbability = len(trainTravel)/(len(trainTravel)+len(trainBusiness)+len(trainSB))
#     businessPriorProbability = len(trainBusiness)/(len(trainTravel)+len(trainBusiness)+len(trainSB))
#     sbPriorProbability = len(trainSB)/(len(trainTravel)+len(trainBusiness)+len(trainSB))
#     words = set(row['words'])
#     for word in words:
#         if word in allTrainDataWords:
#             travelPriorProbability *= allDataFrame.at[word,'Travel Probability']
#             businessPriorProbability *= allDataFrame.at[word,'Business Probability']
#             sbPriorProbability *= allDataFrame.at[word,'Style & Beauty Probability']
#     evaluateData.at[index,'Travel Probability'] = travelPriorProbability
#     evaluateData.at[index,'Business Probability'] = businessPriorProbability
#     evaluateData.at[index,'Style & Beauty Probability'] = sbPriorProbability
#     if travelPriorProbability >= businessPriorProbability and travelPriorProbability >= sbPriorProbability:
#         evaluateData.at[index,'Prediction'] = 'TRAVEL'
#     if businessPriorProbability >= travelPriorProbability and businessPriorProbability >= sbPriorProbability:
#         evaluateData.at[index,'Prediction'] = 'BUSINESS'
#     if sbPriorProbability >= travelPriorProbability and sbPriorProbability>= businessPriorProbability:
#         evaluateData.at[index,'Prediction'] = 'STYLE & BEAUTY'


In [13]:
evaluateData.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description,words
4071,4071,"Travel + Leisure, Contributor\nTravelandLeisur...",TRAVEL,2012-06-28,Europe's Secret Hot Spots (PHOTOS),https://www.huffingtonpost.com/entry/europes-s...,The continent is so varied that even with 17 c...,"[continent, vary, even, countries, share, euro..."
14977,14977,,STYLE & BEAUTY,2013-10-01,11 Fashion Essentials Every 30-Something Shoul...,https://www.huffingtonpost.com/entry/fashion-e...,So you've made it past your trying twenties. Y...,"[make, past, try, twenties, hopefully, invest,..."
16709,16709,Brittany Binowski,TRAVEL,2013-08-04,Your Weekly Travel Zen: Harbors,https://www.huffingtonpost.com/entry/harbor-ph...,Where have you traveled for a moment of zen? E...,"[travel, moment, zen, email, travel, huffingto..."
7382,7382,Abigail Williams,TRAVEL,2016-09-26,West Elm Is Launching Its Own Collection Of Ho...,https://www.huffingtonpost.com/entry/west-elm-...,This is not a drill.,[drill]
12662,12662,"Mike Sternoff, Contributor\nDigital Journalist",TRAVEL,2012-12-10,The Running Of The Bulls Made Simple (VIDEO),https://www.huffingtonpost.com/entry/running-o...,A few years back I packed up everything I owne...,"[years, back, pack, everything, own, head, eur..."


In [14]:
# evaluateData['Correct'] = (evaluateData['category'] == evaluateData['Prediction'])
# correctDetected = (evaluateData['Correct']).sum()
# accuracy =  correctDetected / len(evaluateData)
# AllOfTravel = (evaluateData['category'] == 'TRAVEL').sum()
# AllOfBusiness = (evaluateData['category'] == 'BUSINESS').sum()
# AllOfSB = (evaluateData['category'] == 'STYLE & BEAUTY').sum()
# correctTravel = (evaluateData.loc[(evaluateData['Correct'] == True) & (dataFrame['category'] == 'TRAVEL')].all(axis='columns')).sum()
# correctBusiness = (evaluateData.loc[(evaluateData['Correct'] == True) & (dataFrame['category'] == 'BUSINESS')].all(axis='columns')).sum()
# correctSB = (evaluateData.loc[(evaluateData['Correct'] == True) & (dataFrame['category'] == 'STYLE & BEAUTY')].all(axis='columns')).sum()
# AllTravelDetected = (evaluateData['Prediction'] == 'TRAVEL').sum()
# AllSBDetected = (evaluateData['Prediction'] == 'STYLE & BEAUTY').sum()
# AllBusinessDetected = (evaluateData['Prediction'] == 'BUSINESS').sum()
# recall1 = (correctTravel) / (AllOfTravel)
# recall2 = correctBusiness / AllOfBusiness
# recall3 = correctSB / AllOfSB
# recall = recall1 + recall2 + recall3
# precision1 = (correctTravel) / (AllTravelDetected)
# precision2 = correctBusiness / AllBusinessDetected
# precision3 = correctSB / AllSBDetected
# precision = precision1 + precision2 + precision3
# print(accuracy)
# print(recall)
# print(precision)

## Questions

### 1

Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language. On the other hand, Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. The result of Lemmatization is called lemma which is dictionary form, or citation form of the words. 
We use Lemmatization in our project because we interested in word frequency and want to treat all forms of a word as one. in stemming some forms of a word may not have the same stem and that can influence the frequency of that word and the accuracy of our model as well. 

### 2

tf-idf is short for “term frequency-inverse document frequency. which basically reflects how important a word is to a document.
tf measures how frequently a term occurs in a document.
TF = (Number of times term t appears in a document) / (Total number of terms in the document)
idf measures how important a term is.
IDF = $log_e$(Total number of documents / Number of documents with term t in it)

