### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

In [1]:
import pandas as pd
import numpy as np

# 1 - Collecting the Data

In [3]:
!wget -O data.csv https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv

--2023-09-28 17:15:18--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1475504 (1.4M) [text/plain]
Saving to: ‘data.csv’


2023-09-28 17:15:18 (179 MB/s) - ‘data.csv’ saved [1475504/1475504]



In [2]:
df = pd.read_csv('data.csv')

# 2 - Data Preparation

In [3]:
df.columns = df.columns.str.replace(' ', '_').str.lower()

In [4]:
df.columns

Index(['make', 'model', 'year', 'engine_fuel_type', 'engine_hp',
       'engine_cylinders', 'transmission_type', 'driven_wheels',
       'number_of_doors', 'market_category', 'vehicle_size', 'vehicle_style',
       'highway_mpg', 'city_mpg', 'popularity', 'msrp'],
      dtype='object')

In [5]:
features = [
    'make', 'model', 'year', 'engine_hp',
       'engine_cylinders', 'transmission_type', 'vehicle_style',
       'highway_mpg', 'city_mpg', 'price'
]

In [6]:
#Renaming MSRP to price
df = df.rename(columns={df.columns[-1]: 'price'})
df.columns[-1]

'price'

In [7]:
#Filling the missing values of the features with 0
df[features] = df[features].fillna(0)

In [8]:
df[features].isnull().sum()

make                 0
model                0
year                 0
engine_hp            0
engine_cylinders     0
transmission_type    0
vehicle_style        0
highway_mpg          0
city_mpg             0
price                0
dtype: int64

## Question 1
- What is the most frequent observation (mode) for the column transmission_type?

In [11]:
df[features].transmission_type.value_counts()
# R = AUTOMATIC

AUTOMATIC           8266
MANUAL              2935
AUTOMATED_MANUAL     626
DIRECT_DRIVE          68
UNKNOWN               19
Name: transmission_type, dtype: int64

## Question 2
Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
- What are the two features that have the biggest correlation in this dataset?

In [12]:
df[features].dtypes

make                  object
model                 object
year                   int64
engine_hp            float64
engine_cylinders     float64
transmission_type     object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
price                  int64
dtype: object

In [13]:
numerical = df[features].select_dtypes(include={'int64', 'float64'}).columns
numerical

Index(['year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg',
       'price'],
      dtype='object')

In [14]:
corr = df[numerical].corr()
corr
'''
engine_hp and year: 0.338714
engine_hp and engine_cylinders: 0.774851
highway_mpg and engine_cylinders: -0.614541
highway_mpg and city_mpg: 0.886829 <------
'''

'\nengine_hp and year: 0.338714\nengine_hp and engine_cylinders: 0.774851\nhighway_mpg and engine_cylinders: -0.614541\nhighway_mpg and city_mpg: 0.886829 <------\n'

In [15]:
# Making price binary
above_average = [df.price < df.price.mean()]
above_average = pd.DataFrame(above_average)
above_average = above_average.T
df['above_average'] = above_average

In [16]:
features = [
    'make', 'model', 'year', 'engine_hp',
       'engine_cylinders', 'transmission_type', 'vehicle_style',
       'highway_mpg', 'city_mpg', 'above_average'
]

In [17]:
#Spliting the data
from sklearn.model_selection import train_test_split

In [18]:
df_full_train, df_test = train_test_split(df[features], test_size=0.2, random_state=42)

In [19]:
df_train, df_val = train_test_split(df_full_train, test_size=0.2, random_state=42)

## Question 3
- Calculate the mutual information score between above_average and other categorical variables in our dataset. Use the training set only.
- Round the scores to 2 decimals using round(score, 2).

In [20]:
y_train = df_train.above_average.values
y_val = df_val.above_average.values
y_test = df_test.above_average.values

In [21]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [22]:
del df_train['above_average']
del df_val['above_average']
del df_test['above_average']

In [23]:
from sklearn.metrics import mutual_info_score

In [24]:
def mutual_info_above_average_score(series):
    return mutual_info_score(series, df_full_train.above_average)

In [25]:
#Which of these variables has the lowest mutual information score?
#R: transmission_type
mi = df_full_train[features].apply(mutual_info_above_average_score)
round(mi.sort_values(ascending=False),2)

above_average        0.59
model                0.46
engine_hp            0.36
make                 0.24
engine_cylinders     0.12
vehicle_style        0.08
year                 0.07
city_mpg             0.06
highway_mpg          0.04
transmission_type    0.02
dtype: float64

## Question 4
- Now let's train a logistic regression.
- Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
- Fit the model on the training dataset.
- To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
- model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.


In [26]:
features = [
    'make', 'model', 'year','engine_hp',
       'engine_cylinders', 'vehicle_style','transmission_type',
       'highway_mpg', 'city_mpg'
]

In [27]:
#model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
# One-hot encoding
from sklearn.feature_extraction import DictVectorizer

In [28]:
categorical = list(df[features].select_dtypes(include=['object']).columns)

In [29]:
numerical = list(df[features].select_dtypes(exclude=['object', 'bool']).columns)

In [30]:
print("cat:\n",categorical, "\nnum:\n",numerical)

cat:
 ['make', 'model', 'vehicle_style', 'transmission_type'] 
num:
 ['year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']


In [31]:
train_dicts = df_train[categorical+numerical].to_dict(orient='records')

In [32]:
train_dicts[0]

{'make': 'Volkswagen',
 'model': 'Beetle',
 'vehicle_style': '2dr Hatchback',
 'transmission_type': 'MANUAL',
 'year': 2016,
 'engine_hp': 210.0,
 'engine_cylinders': 4.0,
 'highway_mpg': 31,
 'city_mpg': 23}

In [33]:
dv = DictVectorizer(sparse=False)

In [34]:
X_train = dv.fit_transform(train_dicts)

In [35]:
X_train

array([[2.300e+01, 4.000e+00, 2.100e+02, ..., 0.000e+00, 0.000e+00,
        2.016e+03],
       [1.400e+01, 8.000e+00, 6.500e+02, ..., 0.000e+00, 0.000e+00,
        2.017e+03],
       [1.900e+01, 6.000e+00, 2.960e+02, ..., 0.000e+00, 0.000e+00,
        2.017e+03],
       ...,
       [1.700e+01, 6.000e+00, 2.600e+02, ..., 0.000e+00, 0.000e+00,
        2.012e+03],
       [1.900e+01, 4.000e+00, 1.360e+02, ..., 0.000e+00, 0.000e+00,
        1.993e+03],
       [1.700e+01, 6.000e+00, 3.650e+02, ..., 1.000e+00, 0.000e+00,
        2.015e+03]])

In [36]:
val_dicts = df_val[categorical+numerical].to_dict(orient='records')

In [37]:
X_val = dv.transform(val_dicts)

In [38]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression

In [39]:
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [40]:
y_pred = model.predict_proba(X_val)[: , 1]

In [41]:
y_pred

array([0.99874886, 0.00331597, 0.99965487, ..., 0.99338625, 0.99998666,
       0.99985274])

In [42]:
decision = (y_pred >= 0.5)

In [43]:
(y_val == decision).mean()

0.9381227058206607

In [44]:
df_pred = pd.DataFrame()
df_pred['probability'] = y_pred
df_pred['prediction'] = decision.astype(int)
df_pred['actual'] = y_val

In [45]:
df_pred['correct'] = df_pred.prediction == df_pred.actual

In [46]:
#R = 0.94 ~0.95
round(df_pred.correct.mean(),4)

0.9381

## Question 5
- Let's find the least useful feature using the feature elimination technique.
- Train a model with all these features (using the same parameters as in Q4).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

In [47]:
#Referência: 0.9381
#Sem year: 0.9470 - 0.9381 = 0.0089
#Sem engine_hp: 0.936 - 0.9381 = 0.0021
#Sem transmission_type: 0.9465 - 0.9381 = 0.0084
#Sem city_mpg: 0.9455 - 0.9381 = 0.0074
#R: year
features = [
    'make', 'model', 'year','engine_hp',
       'engine_cylinders', 'vehicle_style','transmission_type',
       'highway_mpg', 'city_mpg'
]
#Sequence
categorical = list(df[features].select_dtypes(include=['object']).columns)
numerical = list(df[features].select_dtypes(exclude=['object', 'bool']).columns)
train_dicts = df_train[categorical+numerical].to_dict(orient='records')
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)
val_dicts = df_val[categorical+numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict_proba(X_val)[: , 1]
decision = (y_pred >= 0.5)
(y_val == decision).mean()
df_pred = pd.DataFrame()
df_pred['probability'] = y_pred
df_pred['prediction'] = decision.astype(int)
df_pred['actual'] = y_val
df_pred['correct'] = df_pred.prediction == df_pred.actual
round(df_pred.correct.mean(),4)

0.9381

## Question 6
    For this question, we'll see how to use a linear regression model from Scikit-Learn.
    We'll need to use the original column price. Apply the logarithmic transformation to this column.
    Fit the Ridge regression model on the training data with a solver 'sag'. Set the seed to 42.
    This model also has a parameter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10].
    Round your RMSE scores to 3 decimal digits.

In [13]:
from sklearn.linear_model import Ridge
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split

In [14]:
features = [
    'make', 'model', 'year', 'engine_hp',
       'engine_cylinders', 'transmission_type', 'vehicle_style',
       'highway_mpg', 'city_mpg', 'price'
]

In [15]:
#Applying the log transform to price
price_logs = np.log1p(df.price)
df['price'] = price_logs

In [16]:
df_full_train, df_test = train_test_split(df[features], test_size=0.2, random_state=42)

In [17]:
df_train, df_val = train_test_split(df_full_train, test_size=0.2, random_state=42)

In [18]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [19]:
#y
y_train = df_train.price.values
y_val = df_val.price.values
y_test = df_test.price.values

In [20]:
del df_train['price']
del df_val['price']
del df_test['price']

In [21]:
features.remove('price')

In [31]:
df_train[categorical+numerical].columns

Index(['make', 'model', 'transmission_type', 'vehicle_style', 'year',
       'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg'],
      dtype='object')

In [25]:
categorical = list(df[features].select_dtypes(include=['object']).columns)
numerical = list(df[features].select_dtypes(exclude=['object', 'bool']).columns)
train_dicts = df_train[categorical+numerical].to_dict(orient='records')
val_dicts = df_val[categorical+numerical].to_dict(orient='records')
test_dicts = df_test[categorical+numerical].to_dict(orient='records')

dv = DictVectorizer(sparse=False)

X_train = dv.fit_transform(train_dicts)
X_val = dv.transform(val_dicts)
X_test = dv.transform(test_dicts)

In [23]:
X_train

array([[2.300e+01, 4.000e+00, 2.100e+02, ..., 0.000e+00, 0.000e+00,
        2.016e+03],
       [1.400e+01, 8.000e+00, 6.500e+02, ..., 0.000e+00, 0.000e+00,
        2.017e+03],
       [1.900e+01, 6.000e+00, 2.960e+02, ..., 0.000e+00, 0.000e+00,
        2.017e+03],
       ...,
       [1.700e+01, 6.000e+00, 2.600e+02, ..., 0.000e+00, 0.000e+00,
        2.012e+03],
       [1.900e+01, 4.000e+00, 1.360e+02, ..., 0.000e+00, 0.000e+00,
        1.993e+03],
       [1.700e+01, 6.000e+00, 3.650e+02, ..., 1.000e+00, 0.000e+00,
        2.015e+03]])

In [35]:
def rmse(y, y_pred):
    se = (y - y_pred)**2
    mse=se.mean()
    return np.sqrt(mse)

In [52]:
clf = Ridge(alpha=0.1, solver='sag', random_state=42)
clf.fit(X_train, y_train)

In [53]:
#alfa = 0 ->    0.047 (numericals + categoricals)
#alfa = 0.01 -> 0.047 (numericals + categoricals)
#alfa = 0.1 ->  0.04743 122454984632 (numericals + categoricals)
#alfa = 1 ->    0.04743 241871967507 (numericals + categoricals)
#alfa = 10 ->   0.04745 529895020893 (numericals + categoricals)
#Declaring the object according to instructions


score = rmse(y_val,clf.predict(X_val))
score

0.04743122454984632