![banner](https://bccfilkom.net/static/assets/images/BCC-Logo.svg)

# Mini Bootcamp Data Science Day-2

### **TO BE A DATA SCIENTIST**

## Review Day 1

In [None]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv(
    "https://raw.githubusercontent.com/SulthanAbiyyu/mini-bootcamp-bcc-2023/master/Supermarket_Sales.csv",
)  
data.head()

In [None]:
data = data.drop_duplicates()

In [None]:
mean_total_col = data['Total'].mean()
data['Total'].fillna(value=mean_total_col, inplace=True)

In [None]:
mode_gender_col = data["Gender"].mode()
data["Gender"].replace("Rather not say", mode_gender_col[0], inplace=True)

In [None]:
gender_outlier = ['Helicopter', 'Computer']
filter_gender = ~data['Gender'].isin(gender_outlier)

data = data[filter_gender]
data.dropna(inplace=True)

In [None]:
def remove_outliers_iqr(data, column_name):
    Q1 = np.percentile(data[column_name], 25)
    Q3 = np.percentile(data[column_name], 75)
    IQR = Q3 - Q1

    maximum = Q3 + (1.5 * IQR)
    minimum = Q1 - (1.5 * IQR)

    outlier_filter = (data[column_name] < minimum) | (
        data[column_name] > maximum)

    data = data[~outlier_filter]

    return data


data = remove_outliers_iqr(data=data, column_name="Unit price")

In [None]:
data = data.drop(["Invoice ID", "Date", "Time"], axis=1)

## Intro Machine Learning

### Apa itu ML?
metode untuk menyelesaikan masalah tanpa harus menjabarkan aturan secara eksplisit. Machine learning dapat belajar dan beradaptasi melalui data yang diberikan. 

![wiml](https://i.imgur.com/guZJACw.png)

### Tipe Machine Learning

Berdasarkan Cara Belajar

1. Supervised -> Diberi contoh yang bener kayak gimana
2. Unsupervised -> Belajar sendiri
3. dll.. 

Berdasarkan task-nya
1. Regresi -> saham, curah hujan, harga bawang
2. Klasifikasi -> tipe, kelas

## Data Preprocessing

### Split
![](https://conlanscientific.com/media/content/splitting-data.png)

-> bayangkan kalo kita kuliah, yang dipelajari selalu beda kan sama yang keluar di ujian?

In [None]:
from sklearn.model_selection import train_test_split

# memisahkan data menjadi data train dan data test
train, test = train_test_split(data, test_size=0.2, random_state=2023)

print("Banyak data: ", len(data))
print("Banyak data train: ", len(train))
print("Banyak data test: ", len(test))

### Kita mau prediksi apa?

In [None]:
data.columns

In [None]:
label_regresi = "Rating"
label_klasifikasi = "Gender"

-> kita pisahin label dan data train-nya \
-> label == kunci jawaban

In [None]:
# membuang kolom label dari data train
train_data = train.drop([label_regresi, label_klasifikasi], axis=1)

# kolom khusus labelnya aja
train_label_regresi = train[label_regresi]
train_label_klasifikasi = train[label_klasifikasi]

In [None]:
test_data = test.drop([label_regresi, label_klasifikasi], axis=1)
test_label_regresi = test[label_regresi]
test_label_klasifikasi = test[label_klasifikasi]

### Scaling

Misalkan ada statement: \
"cowo filkom lebih sering isi ulang air minum daripada cewe filkom"

-> Apa karena cowo filkom emang sering haus? \
-> Atau karena cowo filkom jumlahnya lebih banyak, makanya waktu observasi sering ketemu cowo?

In [None]:
# Ambil kolom yang bertipe numerik
kolom_numerik = train_data.select_dtypes(include=np.number).columns.tolist()
kolom_numerik

#### Standard Scaling

- mean jadi 0
- standar deviasi jadi 1

![](https://i.stack.imgur.com/QEPAU.png)

In [None]:
from sklearn.preprocessing import StandardScaler

# membuat objek StandardScaler
scaler = StandardScaler()
# copy data train 
train_ss = train_data.copy()
# fit data train
train_ss[kolom_numerik] = scaler.fit_transform(train[kolom_numerik])
train_ss.head()

In [None]:
# cek standar deviasi dan mean
train_ss.describe()

#### Min Max Scaling

- min dan max nya pada rentang tertentu
- biasanya antara 0 dan 1

![](https://michael-fuchs-python.netlify.app/post/2019-08-31-feature-scaling-with-scikit-learn_files/p18p7.png)

In [None]:
from sklearn.preprocessing import MinMaxScaler

# membuat objek MinMaxScaler
minmax = MinMaxScaler()
# copy data train
train_mm = train_data.copy()
# fit data train
train_mm[kolom_numerik] = minmax.fit_transform(train[kolom_numerik])
train_mm.head()

In [None]:
# cek min dan max
train_mm.describe()

In [None]:
# misal kita pilih standar scaler aja
test_ss = test_data.copy()
# transform data test
test_ss[kolom_numerik] = scaler.transform(test[kolom_numerik])
test_ss.head()

### Feature Transformation
-> merubah fitur

In [None]:
train_ss.head()

In [None]:
# Ambil kolom yang bertipe object
kolom_objek = train_ss.select_dtypes(include=object).columns.tolist()
kolom_objek

![](https://miro.medium.com/v2/resize:fit:1358/1*ggtP4a5YaRx6l09KQaYOnw.png)

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
# ubah ke bentuk ohe untuk semua kolom objek
train_ohe = ohe.fit_transform(train_ss[kolom_objek]).toarray()
# ubah ke bentuk dataframe
train_ohe = pd.DataFrame(train_ohe, columns=ohe.get_feature_names_out())
# digabung dengan data train 
train_ss = pd.concat([train_ss.reset_index(drop=True), train_ohe], axis=1)
# buang kolom objek karena sudah diubah ke ohe
train_ss = train_ss.drop(kolom_objek, axis=1)

train_ss.head()

![](https://miro.medium.com/v2/resize:fit:1400/0*Xhaw5NqAkkqRPxUF.png)

In [None]:
test_ohe = ohe.transform(test_ss[kolom_objek]).toarray()
test_ohe = pd.DataFrame(test_ohe, columns=ohe.get_feature_names_out())
test_ss = test_ss.drop(kolom_objek, axis=1)
test_ss = pd.concat([test_ss.reset_index(drop=True), test_ohe], axis=1)
test_ss.head()

In [None]:
# cek info dari data train, udah gaada kolom objek
train_ss.info()

## Regresi

### Cara Kerja
![](https://miro.medium.com/v2/resize:fit:800/1*nhGPRU12caIw7NK5Rr3p-w.gif) \
source: https://medium.com/swlh/from-animation-to-intuition-linear-regression-and-logistic-regression-f641a31e1caf

### Training

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

# buat model
model_regresi = LinearRegression()
# model_regresi = Ridge()

# training model
model_regresi.fit(train_ss, train_label_regresi)

### Evaluasi

In [None]:
# prediksi
prediksi = model_regresi.predict(test_ss)
prediksi[:5]

In [None]:
# lihat kunci jawaban
test_label_regresi[:5]

In [None]:
from sklearn.metrics import mean_squared_error

# skor
mse = mean_squared_error(test_label_regresi, prediksi)
rmse = np.sqrt(mse)

print("RMSE: ", rmse)

## Klasifikasi

### Cara Kerja
![](https://miro.medium.com/v2/resize:fit:640/1*6ApG38C_7iiuIPP9bopdhQ.gif) \
source: https://m-abdin.medium.com/an-intuitive-overview-of-a-perceptron-with-python-implementation-part-2-animating-the-learning-85cef0152ac3

In [None]:
train_label_klasifikasi.head()

In [None]:
gender_ke_angka = {
    "Male": 0,
    "Female": 1
}

# ubah ke angka
train_label_klasifikasi = train_label_klasifikasi.replace(gender_ke_angka)
train_label_klasifikasi.head()

In [None]:
test_label_klasifikasi = test_label_klasifikasi.replace(gender_ke_angka)
test_label_klasifikasi.head()

### Training

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# buat model
model_klasifikasi = LogisticRegression()
# model_klasifikasi = RandomForestClassifier()

# training model
model_klasifikasi.fit(train_ss, train_label_klasifikasi)

### Evaluasi

In [None]:
# prediksi
prediksi = model_klasifikasi.predict(test_ss)
prediksi[:5]

In [None]:
# lihat kunci jawaban
test_label_klasifikasi[:5]

In [None]:
from sklearn.metrics import accuracy_score

# skor
akurasi = accuracy_score(test_label_klasifikasi, prediksi)
print("Akurasi: ", akurasi)

# Terima Kasih 🙏🔥🦅