# Probability Theory and Random Variables
Andrew Zhang(azhang42)

# What is probablity theory?

As many of us know, probability theory is math that deals with uncertain events/statmeents. For eample, it can deal with how likely it is that a dice will land on 6 or that a coin will land on heads. Random variables are often used in probability theory because they allow us a way to formalize this uncertainty. One important concept when it comes to probability and machine learning is distributions, which we also learn about in stat class. Here is the definition of continuous and discrete destributions:

$$\text{For Discrete RVs}: \text{Probability Mass Function (PMF), } P(X = x)$$
$$\text{For Continuous RVs}: \text{Probability Density Function (PDF), } f(x)$$

## Bayes Theorem

Another important concept in probability theory is Bayes Theorem. It is a way to calculate conditional probability. Here is the formula:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

It is used in some machine learning applications, such as Naive Bayes Classifiers. It is also used in Bayesian Statistics, which is a way to update our beliefs about a certain event as we get more data.

## Naive Bayes

Bayes Theorem can be rewritten as:

$$P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$$

Let's say you have some target class $C$ to predict. The aim is to get the class given the posterior probabilites $P(C|x_1, x_2, ..., x_n)$

If we assume that these features are independent, we get the follwoing expression:

$$P(C|x_1, x_2, ..., x_n) = \frac{P(x_1|C) \times P(x_2|C) \times ... \times P(x_n|C) \times P(C)}{P(x_1) \times P(x_2) \times ... \times P(x_n)}$$

This can be simplified for classification to:

$$P(C) \times \prod_{i=1}^{n} P(x_i|C)$$


## Demonstration

We will use census data from kaggle to predict whether a person makes more than 50k a year with Naive Bayes

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the dataset
file_path = './adult.csv'
data = pd.read_csv(file_path)

column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 
                'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 
                'hours-per-week', 'native-country', 'income']
data.columns = column_names

# Converting categorical columns to numerical using get_dummies
categorical_columns = data.select_dtypes(include=['object']).columns
data_encoded = pd.get_dummies(data, columns=categorical_columns)

# Splitting the data into features and target variable
X = data_encoded.drop(['income_ <=50K', 'income_ >50K'], axis=1)
y = data_encoded['income_ >50K']  # Using 'income_ >50K' as the target variable

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initializing and training the Gaussian Naive Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Making predictions on the test set
y_pred = gnb.predict(X_test)

# Calculating the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')


Accuracy: 79.66%


  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
