
![TSC](https://media.licdn.com/dms/image/C4D12AQGQ7CjWKCQPSw/article-cover_image-shrink_600_2000/0/1583331061002?e=2147483647&v=beta&t=7SEdQ9ZC2l0FmEGTEAtNoCd1B7RcvFZoVIy0MIGl23c)



<font size=8 color=#00FF11> Homework 2: Preparing for Machine Learning </font>

____

Let's step back and think about what are learning this week in Chapter 2. We see that machine learning (ML) projects are quite involved! 

For today, let's break the process into five steps:
1. data science (which was done in ICA 2)
2. data preparation (<font color=#FF9900>which is what this HW covers</font>)
3. ML (next week)
4. metrics (next week)
5. deploy

Of course, not all ML projects exactly follow this plan. We'll adapt the spirit of this workflow as needed throughout the semester. 

____
<font color=#FFAA00> Read </font>

![pen](https://findicons.com/files/icons/766/base_software/128/pencil3.png)

Your first task is to completely read Chapter 2 of your textbook. Please send me any questions you have so that I can include them in the lecture next week. 


In this HW we will focus on the preparation steps. It would be useful for you to follow along starting on page 67 of your text. 

____
<font color=#FFAA00> Impute </font>

![pen](https://findicons.com/files/icons/766/base_software/128/pencil3.png)

Using the online documentation, research each of these methods and write a summary of what each does:
* `dropna`
* `fillna`
* `drop`

In your answer, include details: each of these methods has several options. 

Conclude your descriptions with your best guess on which of these might be preferred. Or, under what circumstances would you expect them to be the best choice? 

Apply each method to "data" in this code to help guide your answers:

In [6]:
import pandas as pd
import numpy as np

# Creating a simple dataset with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, np.nan, 8, 10],
    'C': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1.0,5.0,10
1,2.0,,20
2,,,30
3,4.0,8.0,40
4,5.0,10.0,50


### Answer
* Use `dropna` when you want to remove rows or columns with missing values.
* Use `fillna` when you want to fill missing values with a specific value or method.
* Use `drop` when you want to remove specific rows or columns.

See the three cases I used for `dropna`, `fillna` and `drop`.
* `dropna`: I use `df.dropna()` which mean is default parameters. which axis = 0 and it will drop all the row have 'NA'. Since the 'B' contain most 'N/A' I used subset to 'B' to remove all the 'NA' from 'B'. But since the data is too less which is not a good way to use `dropna` for this dataset. And in situations where the missing values are random, and removing them won't introduce bias into the dataset.

* `fillna`: for the `fillna` I choose the value parameter set to median and which will fill the 'NA' with median calculate before(see in example code). That can propagate the value with some number which is based on other data. which will perserve most of the information in the data. But it will create some bias. When retaining the rows or columns with missing values is crucial. That will be the situation.

* `drop`: 'B' column label and 'axis = 1' will be the location of which 'NA' will be drop in this case. label will be the index and axis = 1 is mean in the column-wise. Since the B column of the sample data contain the most 'N/A' so that drop the second column will be the best option to use `drop`. But it won't remove all the 'n/a'. In the case that use `drop` will be used is specific rows or columns are identified as outliers, irrelevant, or unnecessary for analysis.
    
    



In [9]:
df_new = df.copy()
# Applying dropna to remove rows with NaN values
df_dropped = df_new.dropna(subset=['B'])

# Applying fillna to replace NaN values with 0
median = df_new["B"].median()
df_filled = df_new.fillna(median)

# Applying drop to remove the second row
df_dropped_row = df_new.drop('B', axis=1)

df_dropped,df_filled,df_dropped_row

(     A     B   C
 0  1.0   5.0  10
 3  4.0   8.0  40
 4  5.0  10.0  50,
      A     B   C
 0  1.0   5.0  10
 1  2.0   8.0  20
 2  8.0   8.0  30
 3  4.0   8.0  40
 4  5.0  10.0  50,
      A   C
 0  1.0  10
 1  2.0  20
 2  NaN  30
 3  4.0  40
 4  5.0  50)

_____

As you can see, imputing data is tricky and we might need a more flexible tool. And, we'll need to have a tool that works with the complete ML workflow that includes breaking the data into training, validation and testing (review page 34). 

So far we have been using very simple tools from Pandas; let's upgrade to _scikit-learn_. 

____
<font color=#FFAA00> Sklearn's API </font>


![pen](https://findicons.com/files/icons/766/base_software/128/pencil3.png)

Read page 70 of your textbook. I will cover this in the lecture, but it is really useful to understand the design of _scikit-learn_ so that you can use it very efficiently. 

In a markdown cell:
* define "API"
* describe what a "transformer" is in the context of _scikit-learn_. 

#### Answer
 * **API**: Application Programming Interface, is bacially specifies how to call functions and serves as a contract between the library (e.g. scikit-learn) and the developers who use it. What parameters they accept, and what results they return. A well-designed API simplifies the usage of the library, provides a clear interface, and promotes good software design practices.

* **Transformer**: is use to transform the data. After fit() we will have a “trained” imputer to transform the training set by replacing missing values with the learned data and use transform to replace the missing values. As is the case for a SimpleImputer use strategy median and fit() trained imputer have the median calculated and them transform(), which is replace the missing data.


____
<font color=#FFAA00> SimpleImputer </font>

Let's turn to `SimpleImputer` in scikit-learn. 

![pen](https://findicons.com/files/icons/766/base_software/128/pencil3.png)

Read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) carefully. In a markdown cell, summarize what "strategy" does, what the default is and how this compares to the other methods you summarized above. Conclude with your best assessment of how you would use this library. 

#### Answer

strategy parameter is use to define what methods (mean, median, most_frequent and constant) use to replace the missing data. (default=’mean’)
* **mean and median** are typically used for numerical data.
* **most_frequent** is suitable for imputing categorical data.
* **constant** allows users to specify a constant value for imputation.

If you mean compare to `drop`,`fillna` and `dropna`.  Both scikit-learn's SimpleImputer and pandas' drop, fillna, and dropna methods serve the purpose of handling missing values in a dataset. But SimpleImputer perserve more data information and better for building predictive models. The padas methods are for general data cleaning and manipulation tasks outside of machine learning workflows.

SimpleImputer is fairly easy to deal with missing data with different strategy. The choice of strategy depends on the characteristics of the dataset. It's particularly useful in building machine learning pipelines for data preprocessing.

____
<font color=#FFAA00> Code Practice </font>

![pen](https://findicons.com/files/icons/766/base_software/128/pencil3.png)

Next, let's walk through the code given in the documentation and understand the API. What does this code do? Specifically, what does it return? What kind of variable is `imp_mean` and what are we supposed to do with it? Why is there no data?!

In [1]:
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

Let's make the fake data:

In [2]:
fake_data = np.array([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
fake_data

array([[ 7.,  2.,  3.],
       [ 4., nan,  6.],
       [10.,  5.,  9.]])

What does this do? 

#### Answer
    It make a 3 by 3 np.array with some missing data.

In [8]:
imp_mean.fit(fake_data)


Did anything happen? What do you see? What does `.fit()` do? 

#### Answer
<font color=#FFAA00>
It didn't return any value only the SimpleImputer. The .fit() method is called on the imp_mean object with the fake_data as the argument. And it will train the imputer with exist data and calculate the column mean for next step.

What does transform do? Check its output: does it give what you expected? What if you change the strategy? 

#### Answer
<font color=#FFAA00>
Replace missing values using the mean along each column. If we change to median it will replace the missing data with column median value. same for other strategy.

In [9]:
imp_mean.transform(fake_data)

array([[ 7. ,  2. ,  3. ],
       [ 4. ,  3.5,  6. ],
       [10. ,  5. ,  9. ]])

If we combine these, what does `.fit_transform()` do? 

In [11]:
fake_data = np.array([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
imp_mean.fit_transform(fake_data)

array([[ 7. ,  2. ,  3. ],
       [ 4. ,  3.5,  6. ],
       [10. ,  5. ,  9. ]])

#### Answer
<font color=#FFAA00>
It will directly give me the output after '.fit()' and '.transform()' 

What if we apply the learned imputer to new data? Explain what this gives; is this what you expected? 

#### Answer
<font color=#FFAA00>
the output still based on the old data that been trained on for the imputer and replaced the missing value. 

In [12]:
imp_mean.transform(np.array([[7, 2, np.nan], [4, np.nan, 6], [np.nan, 6, 9]]))

array([[7. , 2. , 6. ],
       [4. , 3.5, 6. ],
       [7. , 6. , 9. ]])

Explain why we would want to use `.fit()` alone, and then `.transform()`, versus combining them in one step. 

#### Answer

* Use .fit() is when someone want to learn from the data without modifying it. and 
* Use .transform() when someone want to apply a learned transformation to new data.     
* Use .fit_transform() when someone want to learn from and transform the training data in a single step.

<font color=#FFAA00>
.fit_transform() is convenient, but the separate steps can be beneficial in scenarios involving multiple datasets. like the case above.

____
<font color=#FFAA00> Encoding and Scaling </font>

In the housing dataset we have many numerical values and most of the methods we have explored so far work fine. For example, the mean of a column of numbers is well defined. 

How do we deal with non-numeric value? In fact, what are non-numeric values? 

![pen](https://findicons.com/files/icons/766/base_software/128/pencil3.png)

Answer all of these questions, each in its own markdown cell:


* what is an ordinal number?
    
    An ordinal number is a categorical variable where the categories have a meaningful order or ranking. Two nearby values are more similar than two distant values. For example in the textbook, for ordered categories such as “bad”, “average”, “good”, and “excellent”. "good" and "excellent are nearby and they are more similar. But for the housing case the ocean_proximity doesn't follow that.


* what is a cardinal number?

    The cardinal numbers are the numbers that are used for counting something. Such as "three apples" the 3 is a cardinal number. The basic cardinal numbers are typically the positive integers.




* how do ordinal and cardinal numbers differ fundamentally from an integer?

    **Integers** are whole numbers that can be positive, negative, or zero. They are just values. 
    
    **Cardinal numbers** represent the quantity or size of a set and answer the question "how many?" or "how much?" 
    
    **Ordinal numbers** represent the order or position of an element in a sequence. Examples of ordinal numbers include 1st, 2nd, 3rd, 4th, and so on.


* what is a nominal number?

    Nominal number is more like a tag and not to denote an actual value or quantity. like the number on the back of a player's basketball shirt. A number used to identify someone or something.

* in data science, what is "encoding"?

    In data science, "encoding" refers to the process of converting categorical data or text-based data into a numerical format that can be used for machine learning algorithms.

* does scikit-learn contain libraries for ordinal encoding? is it the same as one-hot encoding? explain

    Yes it have ordinal encoding. But it is different from one-hot encoding. But they serve for different categories.
    
    **Ordinal encoding** is used when there is a meaningful order among the categories, and preserving this order is important. like “bad”, “average”, “good”, and “excellent” case.
    
    **One-hot encoding** is used for nominal data which means the order are meaningless, and it represents each category as a binary vector. like
    
    ```
    array([[0., 0., 0., 1., 0.],[1., 0., 0., 0., 0.], [0., 1., 0., 0., 0.], ...,
    [0., 0., 0., 0., 1.], [1., 0., 0., 0., 0.], [0., 0., 0., 0., 1.]])
    ```



* when and why would we use encoding as part of the data preparation within the ML pipeline?

    Encoding is performed during the data preprocessing phase after imputing with strategy most frequent. The reason why we use encoding as part of the data preparation is that it handled categorical data make it learnable for the model and dealing with ordinal categorical features helps preserve the order of categories. Proper encoding contributes to the accuracy, effectiveness, and interpretability of the resulting models.

* when would you use `MinMaxScalar` versus `StandardScalar`? what's the point of these libraries? 

    For `MinMaxScalar`, also called normalization is used when the features have different ranges. And you want to preserve the relationship because the models are tend to ignore the smaller range. For example in textbook case the tot_num_bedrooms is larger than median_income so that the model may ignore the median_income feature. It will be affected by the outliers.

    For `StandardScalar`, also called standardization is used when you have outliers. First it subtracts the mean value (so standardized values have a zero mean). then it divides the result by the standard deviation (so standardized values have a standard deviation equal to 1). But for this one you can't not control the range of the data.

    As I mentioned before models are more likely affect by the different range of the features. These libraries can  deal with datasets with features of varying magnitudes. So that all features can be considered by the models.

