# Machine Learning

*Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.*

<center><img src="https://media.licdn.com/dms/image/D5612AQFQdYob-XRpkA/article-cover_image-shrink_720_1280/0/1710750874997?e=2147483647&v=beta&t=j9Ldybg7aZVUnpu4GuhFGYrtuP9hA-LRgMuqIsmt1bc" width="500px"/></center>

## Machine Learning Algorithms

**<p>Machine learning algorithms are divided into two categories: `supervised` and `unsupervised`.</p>**
**Supervised:** *We give it the information directly as an input and we specify a label for its output, and later when it receives a new input, it says I have this label.<p>*
1. Linear Regression
2. Logistic Regression
3. Multiple Regression
4. K Nearest Neighbors
5. Support Vector Machins
6. Decision Tree
7. Naive Bayes

**Unsupervised:** *In this type of algorithm, we don't define hiwa anymore and it categorizes itself and says, "I think these are similar" and puts the ones that are similar in a group or category.*
1. High Dimensional Clusterings
2. Hierarchical Clusterings
3. Dimensional Reduction
4. K-means Clusterings

**POINT:** We have 2 types of routing for editing data and reading it: `Absolute` and `Relative`.</p>

`pd.read_csv('path/file_name.format')`

**If you use the skip command to skip a number of rows, you should attention that it destroys the headers, and to not destroy it, you can use the following command:** </p>

`data=pd.read_csv('path', skiprows=5)`

`data=pd.read_csv('path/file_name.format', skiprows=5, names=['headers_name'])`

**Take a look at some commands:**
1. `data.describe` ---> show us the basic information of the dataset.
2. `print(data.corr().tostrimg())` ---> show us all the dataset
3. `data.isnuall()` ---> show us missing value
4. `data.plot` ---> show us the density plot, which is important to us because it shows the shape of the distribution.
5. `peak=density.max()` ---> show us peak point

**2POINTS:**
1. *The most important plot drawn by `pandas` is `heatmap`.*
2. *Regression is first algorithm of Machine Learning, which finds the best and closest line to the data and is done in 2 modes, `Linear` and `NON-Linear`.*

**Types of Error:**

1. **R2(R Two):** *In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).*
$$ R2= 1 - RSE $$

2. **MAE(Mean Absolute Error):** *In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement. MAE is calculated as the sum of absolute errors divided by the sample size.*
<center><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/3ef87b78a9af65e308cf4aa9acf6f203efbdeded" width="290px"/></center>

3. **MSE(Mean Squared Error):** *In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.*
<center><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/92ea807c3147d94e8762772be5d12511f1d55938" width="190px"/></center>

4. **RMSE(Root Mean Squared Error):** *The root mean square deviation (RMSD) or root mean square error (RMSE) is either one of two closely related and frequently used measures of the differences between true or predicted values on the one hand and observed values or an estimator on the other.*
<center><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/6d689379d70cd119e3a9ed3c8ae306cafa5d516d" width="295px"/></center>

5. **RSE(Relative Squard Error):** *Relative Standard Error (RSE) is the standard error expressed as a proportion of an estimated value. It is usually displayed as a percentage.*
<center><img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSv5WGNl3UsVkBMBASuMC5xhaj2egSVXOkrs_VzM6iMn3cxwwCAU6noSZtq2F5oPBmGvw&usqp=CAU" width="185px"/></center>

6. **RAE(Relative Absolut Error):** *Relative Absolute Error (RAE) is a way to measure the performance of a predictive model. It's primarily used in machine learning, data mining, and operations management. RAE is not to be confused with relative error, which is a general measure of precision or accuracy for instruments like clocks, rulers, or scales.*
<center><img src="https://editor.analyticsvidhya.com/uploads/42009RAE.png" width="200px"/></center>

**Types of regression:**

1. **Simple Linear Regression:** *Only one `x` is checked.*
2. **Multiple Linear Regression:** *A few `x`s are checked.*

## Handling Missing Values

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import metrics
import matplotlib.pyplot as plt

In [2]:
# Read Dataset
df = pd.read_csv('auto-mpg.csv')

In [3]:
# Show first 5 rows
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [4]:
# Checking types to find missing values
df.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower       object
weight            int64
acceleration    float64
model year        int64
origin            int64
car name         object
dtype: object

In [5]:
# Determining the amount of missing value and replacing it with mean or median
horsepower_nulls = np.nonzero(~df.horsepower.str.isdigit())[0]
data = df.replace('?', np.nan)
data.iloc[horsepower_nulls, :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
32,25.0,4,98.0,,2046,19.0,71,1,ford pinto
126,21.0,6,200.0,,2875,17.0,74,1,ford maverick
330,40.9,4,85.0,,1835,17.3,80,2,renault lecar deluxe
336,23.6,4,140.0,,2905,14.3,80,1,ford mustang cobra
354,34.5,4,100.0,,2320,15.8,81,2,renault 18i
374,23.0,4,151.0,,3035,20.5,82,1,amc concord dl


In [6]:
data['horsepower'] = data['horsepower'].fillna(data['horsepower'].astype('float64').mean())
data.iloc[horsepower_nulls, :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
32,25.0,4,98.0,104.469388,2046,19.0,71,1,ford pinto
126,21.0,6,200.0,104.469388,2875,17.0,74,1,ford maverick
330,40.9,4,85.0,104.469388,1835,17.3,80,2,renault lecar deluxe
336,23.6,4,140.0,104.469388,2905,14.3,80,1,ford mustang cobra
354,34.5,4,100.0,104.469388,2320,15.8,81,2,renault 18i
374,23.0,4,151.0,104.469388,3035,20.5,82,1,amc concord dl
