# Naive Bayes Chat Analysis

So in Naive Bayes, we use the fact that P(person | words) = P(words | person) * P(person)
But we assume these are independent probabilities and so we find P(person) and multiple P(word | person)

The only purpose of this model is to tag every nonsense message as sent by santrupti

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
dataset = pd.read_json("messages.json")
messages = pd.json_normalize((dataset["messages"]))
messages.describe()

Unnamed: 0,id,width,height,reply_to_message_id,duration_seconds,self_destruct_period_seconds,message_id,live_location_period_seconds,location_information.latitude,location_information.longitude
count,171354.0,2528.0,2528.0,27215.0,690.0,118.0,10.0,3.0,3.0,3.0
mean,574491.022772,696.777294,723.730617,584502.341172,20.047826,19.228814,562204.5,14928.666667,17.575567,80.134621
std,50472.92013,439.615218,461.674963,39936.704935,369.475884,12.262382,26920.178377,12790.968116,0.14402,2.711292
min,486835.0,94.0,32.0,487333.0,1.0,7.0,538688.0,3600.0,17.451995,78.56378
25%,530848.25,512.0,512.0,551050.5,2.0,10.0,548692.5,7993.0,17.496484,78.569258
50%,574771.5,512.0,512.0,580205.0,3.0,10.0,549987.0,12386.0,17.540973,78.574737
75%,618150.75,960.0,1236.5,617017.5,3.0,30.0,562516.75,20593.0,17.637353,80.920041
max,662150.0,4624.0,5148.0,662140.0,9679.0,40.0,612217.0,28800.0,17.733733,83.265345


In [3]:
data = messages[['text', 'from']]
print(data)

                    text                   from
0                                           NaN
1                     hi   Divyateja Pasupuleti
2              I'm sorry   Divyateja Pasupuleti
3       once upon a time   Divyateja Pasupuleti
4                  Lolol  Santruptiii BH Behera
...                  ...                    ...
171349         My cg bad  Santruptiii BH Behera
171350   Cant enjot life  Santruptiii BH Behera
171351                tf   Divyateja Pasupuleti
171352         wait what   Divyateja Pasupuleti
171353      seriously ah   Divyateja Pasupuleti

[171354 rows x 2 columns]


In [4]:
data['text'] = data['text'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].str.lower()


In [5]:
data['person'] = pd.factorize(data['from'])[0]
print(data)

                    text                   from  person
0                                           NaN      -1
1                     hi   Divyateja Pasupuleti       0
2              i'm sorry   Divyateja Pasupuleti       0
3       once upon a time   Divyateja Pasupuleti       0
4                  lolol  Santruptiii BH Behera       1
...                  ...                    ...     ...
171349         my cg bad  Santruptiii BH Behera       1
171350   cant enjot life  Santruptiii BH Behera       1
171351                tf   Divyateja Pasupuleti       0
171352         wait what   Divyateja Pasupuleti       0
171353      seriously ah   Divyateja Pasupuleti       0

[171354 rows x 3 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['person'] = pd.factorize(data['from'])[0]


In [6]:
data['text'] = data['text'].str.split()
print(data)

                         text                   from  person
0                          []                    NaN      -1
1                        [hi]   Divyateja Pasupuleti       0
2                [i'm, sorry]   Divyateja Pasupuleti       0
3       [once, upon, a, time]   Divyateja Pasupuleti       0
4                     [lolol]  Santruptiii BH Behera       1
...                       ...                    ...     ...
171349          [my, cg, bad]  Santruptiii BH Behera       1
171350    [cant, enjot, life]  Santruptiii BH Behera       1
171351                   [tf]   Divyateja Pasupuleti       0
171352           [wait, what]   Divyateja Pasupuleti       0
171353        [seriously, ah]   Divyateja Pasupuleti       0

[171354 rows x 3 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].str.split()


In [7]:
word_counts_divya = {}
word_counts_santrupti = {}

for row in data.values:
        try:
                if row[2] == 1:
                        for word in row[0]:
                                word_counts_santrupti[word] = word_counts_santrupti.get(word, 1) + 1
                else:
                        for word in row[0]:
                                word_counts_divya[word] = word_counts_divya.get(word, 1) + 1 
        except:
                pass
                        
print(word_counts_santrupti)



In [8]:
total_words_divya = 0
for key, value in word_counts_divya.items():
        total_words_divya += value
        
total_words_santrupti = 0
for key, value in word_counts_santrupti.items():
        total_words_santrupti += value

print(f"Total Divya Words: {total_words_divya}\nTotal santrupti Words: {total_words_santrupti}")

Total Divya Words: 312454
Total santrupti Words: 279226


In [9]:
prior_divya_probability = total_words_divya / (total_words_divya + total_words_santrupti)
prior_santrupti_probability = total_words_santrupti / (total_words_divya + total_words_santrupti)

print(f"Probability of Divya: {prior_divya_probability}\nProbability of Santrupti: {prior_santrupti_probability}")

Probability of Divya: 0.5280793672255273
Probability of Santrupti: 0.4719206327744727


In [11]:
test_example_subject = "fuck off re"
santrupti_prob = prior_santrupti_probability
divya_prob = prior_divya_probability

for word in test_example_subject.lower().split():
    print(f"Word: {word} | Divya Count: {word_counts_divya.get(word, 1)} | Santrupti Count: {word_counts_santrupti.get(word, 1)}")
    santrupti_prob *= word_counts_santrupti.get(word, 1) / total_words_santrupti
    divya_prob *= word_counts_divya.get(word, 1) / total_words_divya

print(f"Santrupti Probability: {santrupti_prob}")
print(f"Divya Probability: {divya_prob}")
if divya_prob > santrupti_prob:
    print("Divya")
else:
    print("Santrupti")

Word: fuck | Divya Count: 444 | Santrupti Count: 115
Word: off | Divya Count: 631 | Santrupti Count: 539
Word: re | Divya Count: 289 | Santrupti Count: 601
Santrupti Probability: 8.075370893030416e-10
Divya Probability: 1.4016871920490007e-09
Divya
