<h2>Feature Engineering & Feature Selection</h2>

<h3>Importing Libraries</h3>

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline
warnings.filterwarnings("ignore")
from scipy.stats import pearsonr, spearmanr
from scipy.stats import f_oneway
from scipy.stats import shapiro 
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes




<h3>Reading Data</h3>

In [3]:
# authenticate
credential = DefaultAzureCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="###",
    resource_group_name="###",
    workspace_name="###",
)

In [4]:
version = "Outlier_Cleaned_Data"
# get a handle of the data asset and print the URI
data_asset = ml_client.data.get(name="Car-Data", version=version)
print(f"Data asset URI: {data_asset.path}")

# read into pandas - note that you will see 2 headers in your data frame - that is ok, for now

df = pd.read_csv(data_asset.path)

Data asset URI: azureml://subscriptions/144c7089-5d3d-40fa-bfaf-6ffb69774b59/resourcegroups/AML-sdk-v2-RG1/workspaces/AML-sdk-v2-RG1-WS1/datastores/workspaceblobstore/paths/LocalUpload/1355fe348222a2929bf9227df96c0c27/Outlier_Cleaned_Data.csv


In [5]:
df.columns

Index(['Unnamed: 0', 'Fuel_Type', 'Power(kw)', 'Max_Torque(nm)', 'Cylinders',
       'Valves_Per_Cylinder', 'Engine_Capacity(cc)', 'Max_Power_Rpm',
       'Max_Torque_Rpm', 'Fuel_System', 'Turbo', 'Co2_Emissions(g/km)',
       'Compression_Ratio'],
      dtype='object')

In [6]:
df = df.drop("Unnamed: 0", axis = 1)

<h3>Feature Selection</h3>

In [7]:
Numerical_df = df.select_dtypes(include=['int', 'float'])
Categorical_df = df.select_dtypes(include=['object'])

<pre>Our dependent feature or target feature is 'Co2_Emissions(g/km)' which is numerical data type. </pre>

<pre>Numerical I/P vs Numerical O/P ('Co2_Emissions(g/km)')</pre>

In [8]:
Numerical_df.columns

Index(['Power(kw)', 'Max_Torque(nm)', 'Valves_Per_Cylinder',
       'Engine_Capacity(cc)', 'Max_Power_Rpm', 'Max_Torque_Rpm',
       'Co2_Emissions(g/km)', 'Compression_Ratio'],
      dtype='object')

<pre>
We will use Pearson correlation coefficient test to check if the Numerical features 
has linear relationship with Target feature.
We will use Spearman's rank correlation coefficient test to check if the Numerical features 
has non-linear relationship with Target feature.
</pre>

In [9]:
def Num_vs_Num_Stat_test(columns):
    Pearson = {}
    Spearman = {}
    for column in columns:
        if column != "Co2_Emissions(g/km)":
            Pearson_test = pearsonr(df[column], df["Co2_Emissions(g/km)"])
            Spearman_test = spearmanr(df[column], df["Co2_Emissions(g/km)"])
            Pearson[column] = Pearson_test
            Spearman[column] = Spearman_test
    return Pearson, Spearman 

In [10]:
columns = Numerical_df.columns
Num_vs_Num_Stat_test(columns)

({'Power(kw)': (0.621586371559939, 3.627950253742235e-254),
  'Max_Torque(nm)': (0.5799720487666042, 9.756990709358874e-214),
  'Valves_Per_Cylinder': (-0.045869675125203624, 0.025328640410985988),
  'Engine_Capacity(cc)': (0.831538932919992, 0.0),
  'Max_Power_Rpm': (0.23223131605784547, 1.7911159574183147e-30),
  'Max_Torque_Rpm': (0.2243308678471081, 1.7000816098202115e-28),
  'Compression_Ratio': (-0.045142640714099486, 0.027745186962266757)},
 {'Power(kw)': SpearmanrResult(correlation=0.5987928129129545, pvalue=2.7543453333030585e-231),
  'Max_Torque(nm)': SpearmanrResult(correlation=0.5452618348665828, pvalue=3.2259926566812656e-184),
  'Valves_Per_Cylinder': SpearmanrResult(correlation=0.009162154688796657, pvalue=0.6552573742917072),
  'Engine_Capacity(cc)': SpearmanrResult(correlation=0.7999453371071605, pvalue=0.0),
  'Max_Power_Rpm': SpearmanrResult(correlation=0.19437809114571428, pvalue=1.1420443359462157e-21),
  'Max_Torque_Rpm': SpearmanrResult(correlation=0.204227603897

<pre>
Pearson Correlation Co-efficient ->
High degree: If the coefficient value lies between ± 0.50 and ± 1, then it is said to be a strong correlation. 
Moderate degree: If the value lies between ± 0.30 and ± 0.49, then it is said to be a medium correlation. 
Low degree: When the value lies below + . 29, then it is said to be a small correlation.

Spearman Rank Correlation -> 
1.0 (a perfect positive correlation) and -1.0 (a perfect negative correlation).
0 indicates no association between ranks.
</pre>

<pre>
Analysing the above results, we can say that the following features have no linear or non linear relationship
with the target feature.
1. Valves_Per_Cylinder
2. Compression_Ratio
</pre>

In [11]:
df = df.drop(columns=["Valves_Per_Cylinder","Compression_Ratio"], axis = 1)

<pre>Categorical I/P vs Numerical O/P ('Co2_Emissions(g/km)')</pre>

In [12]:
Categorical_df

Unnamed: 0,Fuel_Type,Cylinders,Fuel_System,Turbo
0,gasoline,"4, in line",multipoint injection,"yes, with intercooler"
1,gasoline,"4, in line",multipoint injection,"yes, with intercooler"
2,gasoline,"4, in line",multipoint injection,"yes, with intercooler"
3,gasoline,"4, in line",direct injection,no
4,gasoline,"4, in line",multipoint injection,no
...,...,...,...,...
2372,gasoline,"4, in line",direct injection,"yes, with intercooler"
2373,gasoline,"4, in line",direct injection,"yes, with intercooler"
2374,gasoline,"4, in line",direct injection,"yes, with intercooler"
2375,gasoline,"4, in line",direct injection,"yes, with intercooler"


<pre>
To perform statistical test let's check how many unique categories are present within each categorical feature.
</pre>

In [13]:
def unique_categories_func(columns):
    Result = {}
    for column in columns:
        Uniques = df[column].unique()
        Result[column] = len(Uniques)
    return  Result

In [14]:
columns = Categorical_df.columns
unique_categories_func(columns)

{'Fuel_Type': 4, 'Cylinders': 19, 'Fuel_System': 6, 'Turbo': 5}

<pre>
Since every categorical feature has more than 2 unique values we will use Anova test.
Also before performing the test we have to check for assumption of normality.
</pre>

<pre>We will use Shapiro-Wilk Test to check normality of the feature.
If the p-value of the test is greater than α = .05, then the data is assumed to be normally distributed.
</pre>

In [15]:
shapiro(df["Co2_Emissions(g/km)"])

ShapiroResult(statistic=0.895675778388977, pvalue=1.9060724034818683e-37)

<pre>
The data is normally distributed.
</pre>

In [16]:
def Anova_func(columns):
    for column in columns:
        All_Groups = []
        for group in df[column].unique():
            Group = df[df[column] == group]["Co2_Emissions(g/km)"]
            All_Groups.append(Group)
        print(column,":")
        print(f_oneway(*All_Groups))

In [17]:
columns = Categorical_df.columns
Anova_func(columns) 

Fuel_Type :
F_onewayResult(statistic=1.0433715907928685, pvalue=0.3722030255702256)
Cylinders :
F_onewayResult(statistic=226.5972394872426, pvalue=0.0)
Fuel_System :
F_onewayResult(statistic=18.12008365993093, pvalue=1.1066299620038037e-17)
Turbo :
F_onewayResult(statistic=26.591601701080272, pvalue=1.3209604920022089e-21)


<pre>
Based on the Analysis, the feature Fuel_Type fails to reject the null hypothesis, since pvalue > 0.05 which means 
there is no difference between groups. 
</pre>

In [17]:
df = df.drop("Fuel_Type", axis = 1)

<h3>Save Data</h3>

In [18]:
df.to_csv("../Data/Feature_Selected_Data.csv")

### Upload data to cloud storage

In [19]:
# update the 'my_path' variable to match the location of where you downloaded the data on your
# local filesystem

my_path = "../Data/Feature_Selected_Data.csv"
# set the version number of the data asset
version = "Feature_Selected_Data"

my_data = Data(
    name="Car-Data",
    version=version,
    description="Co2 Emissions Prediction - Car Data",
    path=my_path,
    type=AssetTypes.URI_FILE,
)

## create data asset if it doesn't already exist:
ml_client.data.create_or_update(my_data)
print(f"Data asset created. Name: {my_data.name}, version: {my_data.version}")

[32mUploading Feature_Selected_Data.csv[32m (< 1 MB): 0.00B [00:00, ?B/s][32mUploading Feature_Selected_Data.csv[32m (< 1 MB): 100%|██████████| 226k/226k [00:00<00:00, 14.4MB/s]
[39m



Data asset created. Name: Car-Data, version: Feature_Selected_Data
