#Naïve Bayes

* [NB in Wikipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Relation_to_logistic_regression)


* Naive bayes is used for strings and numbers(categorically) it can be used for classification so it can be either 1 or 0 nothing in between like 0.5 (regression)

* [Technical Note: Naive Bayes for Regression](https://link.springer.com/content/pdf/10.1023%2FA%3A1007670802811.pdf) Shows that NB is not proper for regression task.


* Relation to logistic regression: naive Bayes classifier can be considered a way of fitting a probability model that optimizes the joint likelihood p(C , x), while logistic regression fits the same probability model to optimize the conditional p(C | x).





##Advantages
* It is not only a simple approach but also a fast and accurate method for prediction.
* Naive Bayes has very low computation cost.
* It can efficiently work on a large dataset.
* It performs well in case of discrete response variable compared to the continuous variable.
* It can be used with multiple class prediction problems.
* It also performs well in the case of text analytics problems.
When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression.

##Disadvantages
* The assumption of independent features. In practice, it is almost impossible that model will get a set of predictors which are entirely independent.
* If there is no training tuple of a particular class, this causes zero posterior probability. In this case, the model is unable to make predictions. This problem is known as Zero Probability/Frequency Problem.

In [100]:
!gdown --id 1t3gVQVCAn19xSa-CzTzxLHRq3XeeoFlU

Downloading...
From: https://drive.google.com/uc?id=1t3gVQVCAn19xSa-CzTzxLHRq3XeeoFlU
To: /content/titanic.csv
100% 61.2k/61.2k [00:00<00:00, 20.6MB/s]


In [101]:
import pandas as pd


In [102]:
df = pd.read_csv("titanic.csv")

In [103]:
df.head(1)

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0


 The code you provided is written in Python and uses the `groupby()` and `describe()` functions on a DataFrame called `df`. Let's break down what this code does and its structure.

```python
df.groupby("Sex").describe()
```

### Code Explanation:

1. `df`: This is the DataFrame object that contains the data you want to analyze. It likely has multiple columns, including a column named "Sex" that is used for grouping.

2. `groupby("Sex")`: The `groupby()` function is used to group the rows of the DataFrame based on the unique values in the "Sex" column. This means that all rows with the same value in the "Sex" column will be grouped together.

3. `describe()`: The `describe()` function is applied to each group created by the `groupby()` operation. It computes various summary statistics for each group. The result is a new DataFrame that contains these summary statistics.

4. The code doesn't store the result of `df.groupby("Sex").describe()` into a variable, so the result will be displayed in the output, assuming the code is being executed in an interactive environment or a Jupyter Notebook.

### Functionality and Features:

The code provides a summary analysis of the DataFrame `df` grouped by the "Sex" column. The resulting summary statistics include count, mean, standard deviation, minimum value, 25th percentile, median (50th percentile), 75th percentile, and maximum value for each numerical column in the original DataFrame. These statistics give insights into the distribution and central tendency of the data, grouped by the unique values in the "Sex" column.

### Example and Use Cases:

Let's consider an example to illustrate the use of this code. Assume we have a DataFrame `df` with the following structure:

```
   Name    Sex  Age  Height
0  John    Male   25     180
1  Jane  Female   30     165
2  Alex    Male   28     175
3  Mary  Female   32     160
4  Mark    Male   27     185
```

If we apply the code `df.groupby("Sex").describe()`, it will group the rows based on the unique values in the "Sex" column, which are "Male" and "Female". Then, it will compute the summary statistics for each group:

```
       Age                                      Height
     count  mean      std   min    25%   50%    75%   max
Sex                                                      
Female  2.0  31.0  1.414214  30.0  30.50  31.0  31.50  32.0
Male    3.0  26.7  1.527525  25.0  26.00  27.0  27.50  28.0
```

From the output, we can see that the "Female" group has 2 entries, with an average age of 31 and a standard deviation of 1.41. The "Male" group has 3 entries, with an average age of 26.7 and a standard deviation of 1.53. The same statistics are computed for the "Height" column as well.

This code is particularly useful when you want to compare summary statistics across different groups in your dataset. It helps in identifying patterns, differences, or similarities in the data based on the grouping variable. In the example above, it provides a concise summary of the ages and heights for males and females in the dataset.

In [104]:
df.groupby("Sex").describe()

Unnamed: 0_level_0,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,PassengerId,Pclass,Pclass,...,Fare,Fare,Survived,Survived,Survived,Survived,Survived,Survived,Survived,Survived
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
female,314.0,431.028662,256.846324,2.0,231.75,414.5,641.25,889.0,314.0,2.159236,...,55.0,512.3292,314.0,0.742038,0.438211,0.0,0.0,1.0,1.0,1.0
male,577.0,454.147314,257.486139,1.0,222.0,464.0,680.0,891.0,577.0,2.389948,...,26.55,512.3292,577.0,0.188908,0.391775,0.0,0.0,0.0,0.0,1.0


In [105]:
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)
df.head(1)

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,22.0,7.25,0


In [106]:
target = df.Survived
inputs = df.drop("Survived",axis="columns")
inputs.head()

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,male,22.0,7.25
1,1,female,38.0,71.2833
2,3,female,26.0,7.925
3,1,female,35.0,53.1
4,3,male,35.0,8.05


In [107]:
dummies = pd.get_dummies(inputs.Sex)
dummies.head(3)

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0


The given code snippet appears to be using the pandas library in Python. It assumes that the pandas library has been imported as `pd`.

Let's break down the code and provide a detailed explanation:

```python
1 inputs = pd.concat([inputs, dummies], axis=1)
```

This line of code uses the `pd.concat()` function to concatenate two data frames vertically, along the columns (axis=1). The two data frames being concatenated are `inputs` and `dummies`. The result of this concatenation is then assigned back to the variable `inputs`.

The `pd.concat()` function is commonly used in pandas to combine data frames along a specified axis. In this case, it horizontally joins `inputs` and `dummies` based on their columns. The assumption is that both `inputs` and `dummies` have the same number of rows, and the concatenation is done column-wise.

```python
2 inputs.head(3)
```

This line of code uses the `head()` function to display the first three rows of the `inputs` data frame. The `head()` function is used to inspect the top rows of a data frame, providing a quick overview of its contents.

The purpose of the code snippet as a whole is to concatenate the `dummies` data frame with the `inputs` data frame along the columns and then display the first three rows of the resulting data frame (`inputs`) using the `head()` function.

Use case/example:
Suppose we have two data frames: `inputs` and `dummies`. The `inputs` data frame contains information about various inputs, and the `dummies` data frame contains additional categorical variables represented as dummy variables. By concatenating these two data frames, we can combine the information from both sources into a single data frame for further analysis or modeling.

Here's an example to illustrate the functionality:

```python

In [108]:
inputs = pd.concat([inputs,dummies],axis=1)
inputs.head(3)

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,0,1
1,1,female,38.0,71.2833,1,0
2,3,female,26.0,7.925,1,0


In [109]:
inputs.drop(['Sex', 'male'], axis=1, inplace=True)
inputs.head()

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,0
1,1,38.0,71.2833,1
2,3,26.0,7.925,1
3,1,35.0,53.1,1
4,3,35.0,8.05,0


The provided code appears to be a single line of code that calls a method called `isna()` on a variable or object named `inputs`, and then applies the `any()` function to the result. Since the code is not complete and lacks context, I will provide an explanation based on assumptions about the purpose and structure of the code.

Assuming that `inputs` is a pandas DataFrame or a pandas Series object, here's a detailed explanation of what the code does and how it works:

1. The `isna()` method is a pandas function used to check for missing or null values in a DataFrame or Series. It returns a boolean DataFrame or Series with the same shape as the original object, where each element is `True` if the corresponding element in the original object is missing or null, and `False` otherwise.

2. By calling `isna()` on the `inputs` object, the code checks for missing or null values within the data.

3. The `any()` function is then applied to the result of `isna()`. The `any()` function is a built-in Python function that returns `True` if any element in the iterable (in this case, the boolean DataFrame or Series) is `True`, and `False` otherwise.

4. The final result of this code is a boolean value (`True` or `False`) that indicates whether there are any missing or null values in the `inputs` DataFrame or Series.

Example:

Let's assume we have a DataFrame `inputs` representing a dataset of students' exam scores:

```
   Name  Score1  Score2
0  John    80.0    90.0
1  Mary    75.0     NaN
2  Alex     NaN    85.0
3  Jane    90.0    92.0
```

If we apply the provided code to this DataFrame, the result would be:

```
Name      False
Score1     True
Score2     True
dtype: bool
```

This output indicates that there are missing or null values in the "Score1" and "Score2" columns, while the "Name" column does not have any missing values.

Use case:

The code snippet is useful for quickly checking whether a DataFrame or Series contains any missing or null values. It can be used as a data quality check or as a preliminary step before performing data cleaning or analysis. For example, if there are missing values, you can decide to remove or impute them before further analysis.

In [110]:
inputs.isna().any()

Pclass    False
Age        True
Fare      False
female    False
dtype: bool

In [111]:
inputs.isna().sum()

Pclass      0
Age       177
Fare        0
female      0
dtype: int64

In [112]:
inputs.Age[:20]

0     22.0
1     38.0
2     26.0
3     35.0
4     35.0
5      NaN
6     54.0
7      2.0
8     27.0
9     14.0
10     4.0
11    58.0
12    20.0
13    39.0
14    14.0
15    55.0
16     2.0
17     NaN
18    31.0
19     NaN
Name: Age, dtype: float64

The provided code is written in Python and appears to be manipulating a DataFrame object called `inputs`. Let's break down the code and understand its functionality:

```python
1 inputs.Age = inputs.Age.fillna(inputs.Age.mean())
2 inputs.head()
```

Line 1:
- The code assigns a new value to the `Age` column of the `inputs` DataFrame.
- The `fillna()` method is used to fill any missing values (NaN) in the `Age` column.
- The `mean()` method is called on the `Age` column, which calculates the mean (average) value of the non-missing entries.
- The resulting mean value is then used to replace the missing values in the `Age` column.

Line 2:
- The code calls the `head()` method on the `inputs` DataFrame.
- The `head()` method returns the first few rows of the DataFrame, providing a preview of the modified DataFrame after filling missing values.

In summary, the code is intended to handle missing values in the `Age` column of the `inputs` DataFrame. It replaces any missing values with the mean value of the non-missing entries in the same column. The modified DataFrame is then displayed by showing the first few rows using the `head()` method.

Example use case:
Suppose you have a dataset containing information about individuals, including their ages. However, some entries have missing values for the age. To address this issue, you can use the provided code to fill in the missing values with the average age of the known entries. This ensures that the dataset remains complete and allows for further analysis or processing that requires complete data. The `head()` method call in line 2 gives you a glimpse of the modified dataset to verify the changes.

In [113]:
inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs.head()

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,0
1,1,38.0,71.2833,1
2,3,26.0,7.925,1
3,1,35.0,53.1,1
4,3,35.0,8.05,0


In [114]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(inputs, target, test_size=0.3)

In [115]:
from sklearn.naive_bayes import GaussianNB

In [116]:
model = GaussianNB()

The provided code snippet represents a function call `model.fit(X_train, Y_train)`. The code is likely part of a machine learning or deep learning program and is written in a programming language that supports the `fit` method, such as Python.

The `fit` method is commonly used in machine learning frameworks, like scikit-learn or TensorFlow, to train a model using a given dataset. It is used to optimize the model's parameters or weights based on the provided input data (`X_train`) and corresponding output or target values (`Y_train`).

Here is a breakdown of the code:

- `model`: This refers to an instance or object of a machine learning model class. The specific model being used can vary depending on the context. It could be a linear regression model, a neural network, a support vector machine, or any other model that supports the `fit` method.

- `fit()`: This is a method provided by the `model` object, typically from a machine learning library or framework. The `fit` method is responsible for training the model.

- `X_train`: This is the input or feature data used for training the model. It is usually a matrix or a two-dimensional array-like structure, where each row represents a training sample or instance, and each column represents a specific feature or input variable.

- `Y_train`: This is the target or output data corresponding to the input data (`X_train`). It is also typically a matrix or a one-dimensional array-like structure, where each element represents the expected or true output value for a particular training sample.

When the `fit` method is called, the model internally uses an optimization algorithm, such as gradient descent or stochastic gradient descent, to iteratively adjust its internal parameters or weights. The goal is to minimize the difference between the model's predicted output and the actual target output.

During the training process, the model analyzes the input data (`X_train`) and compares its predictions with the target values (`Y_train`). It calculates the difference between the predicted and target outputs, known as the loss or error, and adjusts the model's parameters accordingly. This iterative process continues for a specified number of epochs or until a convergence criteria is met.

The `fit` method can also include additional parameters, such as batch size, learning rate, regularization parameters, or validation data, depending on the specific implementation and requirements of the model.

Example use case:
Let's say we have a dataset of housing prices, where the features are the number of bedrooms, the size of the house, and the location, and the target variable is the price. We want to train a regression model to predict house prices based on these features.

We can split the dataset into training and testing sets (`X_train`, `X_test`, `Y_train`, `Y_test`). Then, we can instantiate a regression model object (`model`) and call the `fit` method with the training data (`X_train`, `Y_train`):

```python
from sklearn.linear_model import LinearRegression

# Assuming X_train, X_test, Y_train, Y_test are already defined

model = LinearRegression()
model.fit(X_train, Y_train)
```

The `fit` method will train the linear regression model using the provided training data. After training, the model will have learned the relationships between the features and the target variable, allowing us to make predictions on new, unseen data.

In [117]:
model.fit(X_train,Y_train)

The provided code is a single line of code that calls the `score` method on an object called `model` with two parameters: `X_test` and `Y_test`. Here's a detailed explanation of what this code does and how it works:

1. The `model` object represents a machine learning model that has been previously trained on some data. It could be any type of machine learning model, such as a linear regression, decision tree, random forest, support vector machine, or neural network. The specific type of model and its characteristics are not evident from this line of code alone.

2. The `score` method is being called on the `model` object. The purpose of the `score` method is to evaluate the performance of the model on a given dataset. It is typically used for classification or regression tasks, where the model predicts target values for a set of input features.

3. The `score` method takes two parameters:
   - `X_test`: This parameter represents the test data or input features on which the model will make predictions. It is a dataset or a matrix-like object with dimensions [n_samples, n_features]. Each row of `X_test` corresponds to a sample, and each column represents a feature or attribute of that sample.
   - `Y_test`: This parameter represents the true or expected target values corresponding to the test data. It is a one-dimensional array-like object of length `n_samples`, where each element represents the true value of the target variable for a given sample.

4. The purpose of calling `model.score(X_test, Y_test)` is to compute a performance metric that quantifies how well the model's predictions match the true target values. The specific metric used depends on the type of model and the problem being solved. For classification tasks, common metrics include accuracy, precision, recall, F1 score, and others. For regression tasks, common metrics include mean squared error (MSE), mean absolute error (MAE), R-squared, and others.

5. The `score` method internally makes predictions using the trained model on the provided test data (`X_test`) and compares these predictions to the true target values (`Y_test`). It then computes the performance metric and returns the result.

6. The result of `model.score(X_test, Y_test)` is typically a numerical value representing the performance of the model on the test data. The interpretation of this value depends on the specific metric being used. Higher values usually indicate better performance, but this may not always be the case depending on the metric used.

Use case example:
Suppose we have trained a machine learning model to predict whether an email is spam or not based on its content. We have a separate dataset called `X_test` that contains new email contents, and `Y_test` contains the true labels (spam or not spam) for these emails. By calling `model.score(X_test, Y_test)`, we can evaluate how accurately our model predicts the email labels compared to the true labels. The resulting score can help us assess the effectiveness of our model in distinguishing spam from non-spam emails.

In [118]:
model.score(X_test,Y_test)

0.7947761194029851

In [119]:
model.predict(X_test[0:10])

array([0, 1, 0, 1, 1, 0, 1, 1, 0, 0])

Sure! The code you provided seems to be calling the `predict_proba()` method on a `model` object, passing in a subset of the `X_test` dataset. Let's break down the code and provide a detailed explanation.

```python
model.predict_proba(X_test[:10])
```

This code assumes that you have already defined and trained a machine learning model named `model` and you have a dataset `X_test` available for testing or evaluation purposes.

The `predict_proba()` method is commonly used in classification tasks with machine learning models. It returns the predicted probabilities for each class label based on the input data. The code is using this method to predict the probabilities for the first 10 samples in the `X_test` dataset.

Here's a step-by-step breakdown of what the code does:

1. `X_test[:10]` slices the `X_test` dataset to retrieve the first 10 samples. This assumes that `X_test` is a collection of input features for the test data.

2. The sliced subset of the `X_test` dataset, containing the first 10 samples, is passed as an argument to the `predict_proba()` method of the `model` object.

3. The `predict_proba()` method generates predicted probabilities for each class label for the given input data. The output will be an array or matrix where each row corresponds to a sample, and each column represents the probability of that sample belonging to a specific class.

4. The output of `model.predict_proba(X_test[:10])` will be the predicted probabilities for the first 10 samples in the `X_test` dataset.

It's worth noting that the specific behavior and structure of the output will depend on the machine learning framework or library used, as well as the type of model being used (e.g., logistic regression, decision tree, neural network, etc.).

Use cases for this code may include:

- Evaluating a classification model: By comparing the predicted probabilities to the true labels, you can assess how confident the model is in its predictions. This can be useful for evaluating the model's performance and identifying cases where the model may be uncertain.

- Decision making: If you have a trained model and want to make predictions on new, unseen data, you can use `predict_proba()` to get the predicted probabilities for each class. This can help you make decisions based on the confidence of the model's predictions.

- Threshold selection: If you want to classify new data based on certain criteria (e.g., only classify as positive if the predicted probability is above a certain threshold), `predict_proba()` can provide the necessary probabilities to apply the desired thresholds.

Remember to replace `model` and `X_test` in the code with the actual names of your model object and test dataset, respectively, for the code to work correctly.

In [120]:
model.predict_proba(X_test[:10])

array([[0.95794261, 0.04205739],
       [0.47289053, 0.52710947],
       [0.77372393, 0.22627607],
       [0.03472067, 0.96527933],
       [0.02140959, 0.97859041],
       [0.6875139 , 0.3124861 ],
       [0.07658162, 0.92341838],
       [0.41986965, 0.58013035],
       [0.77670712, 0.22329288],
       [0.96221334, 0.03778666]])

Sure! The code you provided utilizes the `cross_val_score` function from the `sklearn.model_selection` module in scikit-learn. This function is used for performing cross-validation on a machine learning model.

Here's a breakdown of the code:

1. The first line imports the `cross_val_score` function from the `sklearn.model_selection` module. This function is responsible for performing cross-validation.

2. The second line calls the `cross_val_score` function with the following arguments:
   - The first argument, `GaussianNB()`, represents an instance of the Gaussian Naive Bayes classifier. This is the machine learning model that will be evaluated using cross-validation.
   - The second argument, `X_train`, represents the input features of the training dataset. It is typically a 2-dimensional array or dataframe.
   - The third argument, `Y_train`, represents the target labels of the training dataset. It is usually a 1-dimensional array or series.
   - The `cv` parameter is set to 5, which specifies the number of cross-validation folds to use. In this case, 5-fold cross-validation will be performed.
   - The `scoring` parameter is set to "accuracy". This specifies the evaluation metric to be used during cross-validation. In this case, accuracy is used to measure the performance of the model.

3. The `cross_val_score` function returns an array of scores, where each score corresponds to the evaluation metric (accuracy in this case) for each fold of cross-validation.

4. The `.mean()` method is chained to the end of the `cross_val_score` function. It calculates the average of the scores obtained from cross-validation. This provides a single value that represents the overall performance of the model.

In summary, this code snippet uses the Gaussian Naive Bayes classifier to perform 5-fold cross-validation on a given training dataset (`X_train` and `Y_train`). It evaluates the model's accuracy using cross-validation and returns the average accuracy score.

Example:
```python
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB

# Assuming X_train and Y_train are already defined

# Perform 5-fold cross-validation on the Gaussian Naive Bayes classifier
accuracy_scores = cross_val_score(GaussianNB(), X_train, Y_train, cv=5, scoring="accuracy")

# Calculate the average accuracy score
mean_accuracy = accuracy_scores.mean()

print("Average Accuracy:", mean_accuracy)
```

Use case:
This code is useful when you want to assess the performance of a machine learning classifier using cross-validation. Cross-validation helps in obtaining a more reliable estimate of the model's performance by training and evaluating the model on different subsets of the data. The average accuracy score obtained can be used to compare different classifiers or tune hyperparameters of the chosen classifier.

In [121]:
from sklearn.model_selection import cross_val_score
cross_val_score(GaussianNB(),X_train,Y_train, cv=5, scoring="accuracy").mean()

0.7655870967741937