# Question 1: Discretization

The categories:
- Cold     :        x <  70
- Warm     : 70  <= x <= 85
- Hot      : 85  <  x <  100
- Very Hot : 100 <= x

In [2]:
import pandas as pd

temperature = pd.DataFrame({ 'data' : [72, 85, 90, 77, 65, 80, 95, 102, 60, 68, 88, 73, 78, 69, 91, -10, -5, -20]})

temperature['label'] = pd.cut(temperature.data, bins=[-float('inf'),70,85,100,float('inf')], labels=["Cold", "Warm", "Hot", "Very Hot"])

print(temperature)

    data     label
0     72      Warm
1     85      Warm
2     90       Hot
3     77      Warm
4     65      Cold
5     80      Warm
6     95       Hot
7    102  Very Hot
8     60      Cold
9     68      Cold
10    88       Hot
11    73      Warm
12    78      Warm
13    69      Cold
14    91       Hot
15   -10      Cold
16    -5      Cold
17   -20      Cold


The code reflects the categories as everything is correctly categorized, with the bins being (-inf,70], (70,85],  (85,100], and (100, inf). With these bins nothing had to be changed to reflect the dataset given with the categories given.

# Question 2: Numeric Coding of Nominal/Ordinal Attributes

## Task 2.A: One-Hot Encoding Using OneHotEncoder

In [1]:
import pandas as pd
from sklearn import preprocessing as pp

car_brands = pd.DataFrame({ 'data' : ["Toyota", "Ford", "Honda", "Toyota", "BMW", "Ford", "Honda"]})

hot = pp.OneHotEncoder(handle_unknown='ignore')

hot_brands = pd.DataFrame(hot.fit_transform(car_brands).toarray())

final_brands = car_brands.join(hot_brands)

print(final_brands)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


     data    0    1    2    3
0  Toyota  0.0  0.0  0.0  1.0
1    Ford  0.0  1.0  0.0  0.0
2   Honda  0.0  0.0  1.0  0.0
3  Toyota  0.0  0.0  0.0  1.0
4     BMW  1.0  0.0  0.0  0.0
5    Ford  0.0  1.0  0.0  0.0
6   Honda  0.0  0.0  1.0  0.0


The categories are as follows Toyota = 3, Honda = 2, Ford = 1, BMW= 0.  
The columns right of the data are categorizing the data into one of the four categories.

## Task 2.B Ordinal Encoding Using OrdinalEncoder

In [1]:
import pandas as pd
from sklearn import preprocessing as pp

brand_size = pd.DataFrame({
    'Brand': ["Toyota", "Ford", "Honda", "Toyota", "BMW", "Ford", "Honda"],
    'Size': ["M", "L", "S", "XL", "M", "S", "L"]
})

ordinal = pp.OrdinalEncoder()

enc_brand_size = pd.DataFrame(ordinal.fit_transform(brand_size))

final_brand_size = brand_size.join(enc_brand_size)

print(final_brand_size)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


    Brand Size    0    1
0  Toyota    M  3.0  1.0
1    Ford    L  1.0  0.0
2   Honda    S  2.0  2.0
3  Toyota   XL  3.0  3.0
4     BMW    M  0.0  1.0
5    Ford    S  1.0  2.0
6   Honda    L  2.0  0.0


The way that numeric values are assigned are as follows
| Numeric Value | Brand | Size |
| ------------- | ----- | ---- |
| 0             | BMW   | L    |
| 1             | Ford  | M    |
| 2             | Honda | S    |
| 3             | Toyota| XL   |

## Task 2.C Numeric Coding Using pandas' Factorize Function

The Brands variable can be reused in this problem.

In [1]:
import pandas as pd
import numpy as np

label, uniques = pd.factorize(['Toyota', 'Ford', 'Honda', 'Toyota', 'BMW', 'Ford', 'Honda'])

print(label, uniques)

[0 1 2 0 3 1 2] ['Toyota' 'Ford' 'Honda' 'BMW']


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
  label, uniques = pd.factorize(['Toyota', 'Ford', 'Honda', 'Toyota', 'BMW', 'Ford', 'Honda'])


The way that it is done is by order so since Toyota is first it gets assigned 0, and Ford being second will get 1, and so on.  
The factorize method is simple because it makes sure that the numers are assigned in order, with scikit-learn it goes and labels them based on alphabetical order within the categories. LabelEncoder does it by the size of the brand so Toyota having the most letters the encoder gives it a label of 3.

## Task 2.D One-Hot Encoding Using pandas' get_dummies Function

In [1]:
import pandas as pd
car_brands = pd.DataFrame({ 'data' : ["Toyota", "Ford", "Honda", "Toyota", "BMW", "Ford", "Honda"]})

dummy = pd.get_dummies(car_brands)
print(dummy)

   data_BMW  data_Ford  data_Honda  data_Toyota
0     False      False       False         True
1     False       True       False        False
2     False      False        True        False
3     False      False       False         True
4      True      False       False        False
5     False       True       False        False
6     False      False        True        False


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


One-Hot encoding is boolean values assigned to a 2D array which makes it fast for computers to just look up if the category was present in the index but it is the worst in terms of space since it needs an array the size of the number of categories times the number of indexes.

# Question 3 Data Preprocessing and Cleansing

In [44]:
import pandas as pd
import numpy as np

customers = pd.DataFrame({
    'CustomeID' : [101, 102, 103, 104, 105, 106],
    'Age' : [25, np.nan, 31,-22, 28, 35],
    'Income': [50000, 62000, np.nan, 45000, 78000, 88000],
    'Gender' : ['Male', 'Female', 'Male', np.nan, 'Female', 'F'],
    'JoinDate' : ['2022-01-15', '2022/01/22', '15-01-2022', '2022-01-22', np.nan, '2022-01-25']
})

customers['Age'] = customers['Age'].abs()
customers['Age'] = customers['Age'].fillna(customers['Age'].mean())

customers['Income'] = customers['Income'].fillna(customers['Income'].mean())

customers['JoinDate'] = pd.to_datetime(customers['JoinDate'], format='mixed')
customers['JoinDate'] = customers['JoinDate'].fillna(customers['JoinDate'].median())

customers['Gender'] = customers['Gender'].replace('F', 'Female')
customers['Gender'] = customers['Gender'].fillna(customers['Gender'].mode().iloc[0])

print(customers)

   CustomeID   Age   Income  Gender   JoinDate
0        101  25.0  50000.0    Male 2022-01-15
1        102  28.2  62000.0  Female 2022-01-22
2        103  31.0  64600.0    Male 2022-01-15
3        104  22.0  45000.0  Female 2022-01-22
4        105  28.0  78000.0  Female 2022-01-22
5        106  35.0  88000.0  Female 2022-01-25


For Age I just took the absolute value and took the mean as it is a good representation of the average age in the group.  
For income I took the average to fill the NaN to make sure that the data is within the given data.  
For Join Date I made sure that I had YYYY-MM-DD and for the NaN I made it the median as that was the easiest to insert into the data, the mean would have had to had more formatting done.
For Gender I replaced F with Female first and then made the NaN into the most common to continue the trend.

# Question 4: Feature  Selection

## Task 4.A Recursive Feature Elimination

In [1]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing


california_housing = fetch_california_housing();
housing_data = california_housing.data
housing_target = california_housing.tar
estimator = LinearRegression()
selector = RFE(estimator, n_features_to_select=5, step=1)

selector = selector.fit(housing_data, housing_target)
X_selected = selector.transform(housing_data)

selected_features = selector.support_
feature_ranking = selector.ranking_

ModuleNotFoundError: No module named 'sklearn'

# Question 5: Data Transformation

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame( {'HouseID': [1,2,3,4,5,6,7], 'LotSize': [400,16000,25000,36000,49000,64000,81000] })
df.insert(2, 'LotSizeRoot', df['LotSize'].apply(lambda x: np.sqrt(x)), True)

fig, ax1 = plt.subplots()

ax1.set_xlabel('House ID')

ax1.set_ylabel('Lot Size')
ax1.plot('HouseID', 'LotSize', data=df)
ax1.tick_params(axis = 'y', labelcolor = 'blue')

ax2 = ax1.twinx()
ax2.set_ylabel('Lot Size Root')
ax2.plot('HouseID', 'LotSizeRoot', data=df, color='orange')
ax2.tick_params(labelcolor = 'orange')

plt.show()


ModuleNotFoundError: No module named 'pandas'

The Square Root transformation made it so that the data given follows a linear curve after the second house. With the unrooted data there is a curve and a more complicated formula is needed to make sure the data works well, as opposed to the rooted data there is more of a relationship between the house id and then lot size.