###Get Data

###What is missing value

**How Is a Missing Value Represented in a Dataset?**
* `NaN # not a number`

**Types of Missing Values**
1. Missing Completely At Random (MCAR)
* The probability of data being missing is the same for all the observations. 
* The statistical analysis remains unbiased.
* e.g. Someone forgetting to type in the values.
2. Missing At Random (MAR)
* There is some relationship between the missing data and other values/data. 
* The statistical analysis might result in bias. Getting an unbiased estimate of the parameters can be done only by modeling the missing data.
* e.g. Most of the females don’t want to reveal their age during the survey.
3. Missing Not At Random (MNAR)
* There is some structure/pattern in missing data and other observed data can not explain it
* e.g. People having less income may refuse to share some information in a survey or questionnaire.

**Why Do We Need to Care About Handling Missing Data?**
* Many machine learning algorithms fail if the dataset contains missing values. However, algorithms like K-nearest and Naive Bayes support data with missing values.
* You may end up building a biased machine learning model, leading to incorrect results if the missing values are not handled properly.
* Missing data can lead to a lack of precision in the statistical analysis.

**There are 2 main methods on how to deal with missing value?**
1. imputation (reasonable guesses) missing data
* Replacing with an arbitrary value, the mean, the mode, the median or the previous/next value.
2. remove the missing data
* remove some rows and/or columns

reference:
* https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/

**Other related information.**
* How to diagnose the missing data mechanism? (https://www.theanalysisfactor.com/missing-data-mechanism/)
* How to select row with missing value? (https://stackoverflow.com/questions/30447083/python-pandas-return-only-those-rows-which-have-missing-values)

**Check if there are missing values.**
* In this case, there are only 5 missing rows and a column out of 511 rows and 14 columns. For the simplicity, I will fill the missing value with its mean value.

In [None]:
print(df.shape)

(511, 14)


In [None]:
'''
# This code selects only rows with missing value
null_data = df[df.isnull().any(axis=1)]
print(null_data)
# reference: https://stackoverflow.com/questions/30447083/python-pandas-return-only-those-rows-which-have-missing-values
'''

# This code counts number of all missing values from each columns.
df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         5
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [None]:
# This code replace missing value of a columns into its mean value.
df['RM'].fillna(df['RM'].mean(), inplace=True)
# reference: https://stackoverflow.com/questions/18689823/pandas-dataframe-replace-nan-values-with-average-of-columns

**Note**
* In this case, there are only 5 missing value, therefore I will replace