## Naive Bayes (Gaussian)

#### TL;DR

In [None]:
from sklearn.naive_bayes import GaussianNB # Naive Bayes (Gaussian) algorithm
GNB = GaussianNB().fit(X_train, y_train)

In [11]:
import os # File management 
import pandas as pd # Data 
from sklearn.model_selection import train_test_split # Splitting data
from sklearn.naive_bayes import GaussianNB # Naive Bayes (Gaussian) algorithm

# Load CSV file data (cleaned and preprocessed) as dataframe.
fp = os.path.join('', 'tweets_sentiment.csv') # .join(folder, file)
df = pd.read_csv(fp, sep='\t', encoding='utf-8')

# Prepare data.
y =  df.loc[ :, 'sentiment_class'] # label to predict
X =  df.loc[ :, ['retweets', 'likes', 'hashtags_number']] # features used to predict label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

# Fit data to Gaussian Naive Bayes.
GNB = GaussianNB().fit(X_train, y_train)

# Predicted accuracy.
accuracy_train = round(GNB.score(X_train, y_train), 2)
accuracy_test  = round(GNB.score(X_test,  y_test), 2)

print('Accuracy - train: {}\nAccuracy - test:  {}'.format(accuracy_train, accuracy_test))

Accuracy - train: 0.71
Accuracy - test:  0.7


#### Contents:
1. Gaussian Naive Bayes description
2. <b>Usage example</b>
3. <b>Template</b>
4. More

#### Libraries used:

In [1]:
import os # File management 
import pandas as pd # Data tables
import matplotlib.pyplot as plt # Plots
import seaborn as sns           # Plots but nicer
from sklearn.model_selection import train_test_split # SK learn - machine Learning library
from sklearn.naive_bayes import GaussianNB # Gaussian Naive Bayes algorithm



### Gaussian Naive Bayes

#### 1. Gaussian Naive Bayes algorithm description
----

Naive Bayes (Gaussian) is Machine learning algorithm that:
- Real time & multi class prediction,
- Works well for text, categorical data.
- Trainable using a small data set,
- (To be updated.)

#### 2. Usage example
------


#### Example problem

Predict tweet sentiment (according to TextBlob model) basing on it's  nr of hashtags, retweet and like counts.


#### Example data

3800 tweets data of phrases containing psychology + AI phrases (and simillar) from a tweets_sentiment.csv file.

Features:
- <b>tweet</b>           - tweet text.
- <b>hashtags</b>        - #hashtags in a tweet.
- <b>hashtags_number</b> - number of hashtags.
- <b>likes</b>           - number of tweet likes 
- <b>retweets</b>        - number of times tweet have been shared.
- <b>sentiment</b>       - score in range: -1.0 to 1.0 .
- <b>sentiment_class</b> - score simplified to: Positive ( > 0) and Negative ( < 0).

Source: scrapping twitter search API.

In [3]:
# Load data
df = pd.read_csv('tweets_sentiment.csv', sep='\t', encoding='utf-8')

# Inspect data - YOU CAN PLAY WITH THIS
show_rows = 3
show_row_first = 127
show_row_last = show_row_first + show_rows

# Display
print('Samples (rows): {:10}\nFeatures (columns): {:6}\n'.format(df.shape[0],df.shape[1]))
print('{} example samples (of {}) starting from {}-th:'.format(show_rows, len(df), show_row_first))
df.iloc[show_row_first : show_row_last,:].head(15)

Samples (rows):       3800
Features (columns):      7

3 example samples (of 3800) starting from 127-th:


Unnamed: 0,tweet,hashtags,hashtags_number,likes,retweets,sentiment,sentiment_class
127,machine ml rt richsimmondsza rt evankirstel wa...,AI human,2,0,6,-0.15,Negative
128,human vs ai new job base the ability compassio...,"Human AI- jobs compassionate, empathetic, emot...",16,0,3,-0.165909,Negative
129,cenotechs impressive ipfconline thank share we...,ArtificialIntelligence AI MachineLearning ML …,5,0,2,1.0,Positive


#### Usage example (concise)

In [4]:
import os # File management 
import pandas as pd # Data tables
from sklearn.model_selection import train_test_split # Splitting data
from sklearn.naive_bayes import GaussianNB # Naive Bayes (Gaussian) algorithm

# Load CSV file data (cleaned and preprocessed) as dataframe.
fp = os.path.join('', 'tweets_sentiment.csv') # .join(folder, file)
df = pd.read_csv(fp, sep='\t', encoding='utf-8')

# Prepare data.
y =  df.loc[ :, 'sentiment_class'] # label to predict
X =  df.loc[ :, ['retweets', 'likes', 'hashtags_number']] # features used to predict label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

# Fit data to Gaussian Naive Bayes.
GNB = GaussianNB().fit(X_train, y_train)

# Predicted accuracy.
accuracy_train = round(GNB.score(X_train, y_train), 2)
accuracy_test  = round(GNB.score(X_test,  y_test), 2)

print('Accuracy - train: {}\nAccuracy - test:  {}'.format(accuracy_train, accuracy_test))

Accuracy - train: 0.71
Accuracy - test:  0.7


#### Usage example (explained)

In [6]:
# Specify data location.
DIR  = ''
file_name = 'tweets_sentiment.csv'

# Join folder and file name as a file path.
file_path  = os.path.join(DIR, file_name)

# Load CSV file data (cleaned and preprocessed) as dataframe.
df = pd.read_csv(file_path, sep='\t', encoding='utf-8')

# Prepare data 1: Split data into features and labels.
features = ['retweets', 'likes', 'hashtags_number']
label    = 'sentiment_class'

X =  df.loc[ :, features]
y =  df.loc[ :, label]


# Prepare data 2: Split data into features and labels.
X_train, X_test, y_train, y_test = train_test_split( 
                                                     X,                # Selected features data
                                                     y,                # Selected label data
                                                     random_state = 0  # 
                                                    )

# Load algorithm - Naive Bayes (Gaussian).  
GNB = GaussianNB()

# Fit data into Gaussian Naive Bayes.
GNB = GNB.fit( 
                       X_train,    # Features; columns your model use to set weights.
                       y_train,    # Column with values you want to predict. 
                      )

# Get metric results - Accuracy.
accuracy_train = GNB.score(X_train, y_train)
accuracy_test  = GNB.score(X_test,  y_test)

# Display metric results
print('Accuracy - train set:  {:.2f}'.format(accuracy_train))
print('Accuracy -  test set:  {:.2f}'.format(accuracy_test))

Accuracy - train set:  0.71
Accuracy -  test set:  0.70


#### 3. Template 

#### Template (concise)

In [10]:
import os # File management 
import pandas as pd # Data tables
from sklearn.model_selection import train_test_split # Splitting data
from sklearn.naive_bayes import GaussianNB  # Naive Bayes (Gaussian) algorithm

# YOU SET THIS:
fp = os.path.join('', 'your file_name.csv') # DIR + file_name
label = 'column_name' # You want to predict this column value...
features    = ['column_name', 'column_name'] # ...basing on those columns data.

# Load and prepare data.
df = pd.read_csv(fp, sep='\t', encoding='utf-8')
X =  df.loc[ :, features]
y =  df.loc[ :, label]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

# Load and fit data to Gaussian Naive Bayes algorithm.
GNB = GaussianNB().fit(X_train, y_train)

# Predicted accuracy.
accuracy_train = round(GNB.score(X_train, y_train), 2)
accuracy_test  = round(GNB.score(X_test,  y_test), 2)

print('Accuracy - train: {}\nAccuracy - test:  {}'.format(accuracy_train, accuracy_test))

Accuracy - train: 0.71
Accuracy - test:  0.7


#### Template (explained)

In [7]:
import os # File management
import pandas as pd # Data tables
from sklearn.model_selection import train_test_split # Splitting data
from sklearn.naive_bayes import GaussianNB # Naive Bayes (Gaussian) algorithm


# YOU FILI THIS:
# -------------------

# File with data:
file_name = ''      # Name of a file with data. Fg. 'tweets_sentiment.csv'
DIR  = ''           # File directory OR '' if file is in program directory.

# Data pick:
predit_what = ''    # What to predict (column)  ? Fg. 'column_name'
based_on    = []    # Basing on what  (columns) ? Features, fg. ['column_name', 'column_name'] 


# Load and prepare data (cleaned and preprocessed).
# -------------------------------------------------

# Load CSV file.
file_path  = os.path.join(DIR, file_name) 

# Load CSV file as a dataframe.
df = pd.read_csv(file_path, sep='\t', encoding='utf-8')

# Split data into features and labels.
X =  df.loc[ :, features] # Selected features data
y =  df.loc[ :, label]    # Selected label data

# Split data into train and test set.
X_train, X_test, y_train, y_test = train_test_split(X,                # Selected features data
                                                    y,                # Selected label data
                                                    random_state = 0) # 

# Load algorithm - Naive Bayes (Gaussian).
GNB = GaussianNB()

# Fit the data into Gaussian Naive Bayes.
GNB = GNB.fit(X_train, y_train)

# Get metric results - Accuracy.
accuracy_train = GNB.score(X_train, y_train)
accuracy_test  = GNB.score(X_test,  y_test)

# Display metric results
print('Accuracy on training set: {:.2f}'.format(accuracy_train))
print('Accuracy on testing set:  {:.2f}'.format(accuracy_test))

Accuracy on training set: 0.71
Accuracy on testing set:  0.70


#### 4. More

To be updated.

By Luke, 10 II 2019.