# Kuis Model Regresi

Kuis ini merupakan bagian dari proses penilaian dalam Training bersama Algoritma. Selamat Anda sudah menyelesaikan materi *Regression Model*! Kami akan melakukan penilaian berupa kuis untuk menguji materi yang sudah dipelajari. Pengerjaan kuis diharapkan dapat dilakukan di dalam kelas, silahkan hubungi tim instruktur kami jika Anda melewatkan kesempatan untuk mengambilnya di kelas.

## Pemahaman Metode

Regresi merupakan bagian dari *supervised machine learning* yang memiliki tujuan untuk memprediksi target variabel bertipe numerik. Metode dalam membangun model regresi bermacam-macam, salah satunya adalah regresi linear. Hal yang perlu diperhatikan sebelum membuat regresi linear adalah memastikan hubungan antar variabel target dengan variabel prediktor.


In [1]:
import pandas as pd
import numpy as np

# suppress scientific notation
np.set_printoptions(suppress=True) # numpy output
pd.set_option('display.float_format', lambda x: '%.3f' % x) # pandas output

# suppress warning
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

1. Di bawah ini, manakah kasus yang dapat diselesaikan menggunakan model regresi?
  - [ ] prediksi karyawan yang akan mengundurkan diri atau tidak
  - [ ] deteksi transaksi *fraud*
  - [X] penentuan harga/nilai ekonomi suatu properti
  - [ ] prediksi sentimen pelanggan terhadap suatu produk

## Eksplorasi Data

Dalam kuis ini, Anda akan menggunakan kumpulan data **kriminolog** (`crime.csv`) untuk membuat model regresi linier. (`crime_train` untuk pembuatan model, dan `crime_test` untuk evaluasi model). 

In [3]:
crime = pd.read_csv('crime.csv')
crime.sample(5)

Unnamed: 0,percent_m,is_south,mean_education,police_exp60,police_exp59,labour_participation,m_per1000f,state_pop,nonwhites_per1000,unemploy_m24,unemploy_m39,gdp,inequality,prob_prison,time_prison,crime_rate
15,142,1,88,81,77,497,956,33,321,116,47,427,247,0.052,26.099,946
45,126,0,104,106,97,599,989,40,24,78,25,593,171,0.047,16.7,508
31,125,0,109,90,81,586,964,97,82,105,43,617,163,0.043,30.901,754
9,140,0,118,71,68,632,1029,7,15,100,24,526,174,0.044,19.599,705
5,121,0,110,118,115,547,964,25,44,84,29,689,126,0.034,21.0,682


Deskripsi variabel :
- `percent_m`: percentage of males aged 14-24
- `is_south`: whether it is in a Southern state. 1 for Yes, 0 for No.  
- `mean_education`: mean years of schooling  
- `police_exp60`: police expenditure in 1960  
- `police_exp59`: police expenditure in 1959
- `labour_participation`: labour force participation rate  
- `m_per1000f`: number of males per 1000 females  
- `state_pop`: state population  
- `nonwhites_per1000`: number of non-whites resident per 1000 people  
- `unemploy_m24`: unemployment rate of urban males aged 14-24  
- `unemploy_m39`: unemployment rate of urban males aged 35-39  
- `gdp`: gross domestic product per head  
- `inequality`: income inequality  
- `prob_prison`: probability of imprisonment  
- `time_prison`: avg time served in prisons  
- `crime_rate`: crime rate in an unspecified category (ukuran tingkat kejahatan untuk setiap Negara Bagian di Amerika Serikat pada tahun 1960)

Lakukanlah exploratory data analysis yang diperlukan sebelum membuat model. Anda bisa melakukan beberapa hal berikut:
- penyesuaian tipe data
- pengecekan missing value
- pengecekan outlier
- linearity test
- dst

In [4]:
crime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   percent_m             47 non-null     int64  
 1   is_south              47 non-null     int64  
 2   mean_education        47 non-null     int64  
 3   police_exp60          47 non-null     int64  
 4   police_exp59          47 non-null     int64  
 5   labour_participation  47 non-null     int64  
 6   m_per1000f            47 non-null     int64  
 7   state_pop             47 non-null     int64  
 8   nonwhites_per1000     47 non-null     int64  
 9   unemploy_m24          47 non-null     int64  
 10  unemploy_m39          47 non-null     int64  
 11  gdp                   47 non-null     int64  
 12  inequality            47 non-null     int64  
 13  prob_prison           47 non-null     float64
 14  time_prison           47 non-null     float64
 15  crime_rate            47 

In [None]:
# your code here


Bayangkan Anda bekerja sebagai analis pemerintah dan ingin melihat bagaimana kondisi sosial-ekonomi dapat mencerminkan tingkat kejahatan suatu negara bagian. Dari data `crime`, kita ingin memprediksi tingkat kejahatan (`crime_rate`) di suatu area di Amerika Serikat. Salah satu cara untuk menentukan prediktor yang sesuai adalah dengan menggunakan nilai korelasi. Gunakan `.corr()` untuk melihat korelasi antar variabel pada data `crime`.

In [9]:
# your code here
crime.corr()

Unnamed: 0,percent_m,is_south,mean_education,police_exp60,police_exp59,labour_participation,m_per1000f,state_pop,nonwhites_per1000,unemploy_m24,unemploy_m39,gdp,inequality,prob_prison,time_prison,crime_rate
percent_m,1.0,0.584,-0.53,-0.506,-0.513,-0.161,-0.029,-0.281,0.593,-0.224,-0.245,-0.67,0.639,0.361,0.115,-0.089
is_south,0.584,1.0,-0.703,-0.373,-0.376,-0.505,-0.315,-0.05,0.767,-0.172,0.072,-0.637,0.737,0.531,0.067,-0.091
mean_education,-0.53,-0.703,1.0,0.483,0.499,0.561,0.437,-0.017,-0.665,0.018,-0.216,0.736,-0.769,-0.39,-0.254,0.323
police_exp60,-0.506,-0.373,0.483,1.0,0.994,0.121,0.034,0.526,-0.214,-0.044,0.185,0.787,-0.631,-0.473,0.103,0.688
police_exp59,-0.513,-0.376,0.499,0.994,1.0,0.106,0.023,0.514,-0.219,-0.052,0.169,0.794,-0.648,-0.473,0.076,0.667
labour_participation,-0.161,-0.505,0.561,0.121,0.106,1.0,0.514,-0.124,-0.341,-0.229,-0.421,0.295,-0.27,-0.25,-0.124,0.189
m_per1000f,-0.029,-0.315,0.437,0.034,0.023,0.514,1.0,-0.411,-0.327,0.352,-0.019,0.18,-0.167,-0.051,-0.428,0.214
state_pop,-0.281,-0.05,-0.017,0.526,0.514,-0.124,-0.411,1.0,0.095,-0.038,0.27,0.308,-0.126,-0.347,0.464,0.337
nonwhites_per1000,0.593,0.767,-0.665,-0.214,-0.219,-0.341,-0.327,0.095,1.0,-0.156,0.081,-0.59,0.677,0.428,0.23,0.033
unemploy_m24,-0.224,-0.172,0.018,-0.044,-0.052,-0.229,0.352,-0.038,-0.156,1.0,0.746,0.045,-0.064,-0.007,-0.17,-0.05


2. Variabel mana di bawah ini yang memiliki korelasi terendah dengan variabel `crime_rate` sehingga diperkirakan tidak cocok sebagai variabel prediktor?
  - [ ] crime_rate
  - [ ] police_exp60
  - [ ] unemploy_m39
  - [x] nonwhites_per1000

In [8]:
# your code here
crime.corr().abs().sort_values(by='crime_rate')

Unnamed: 0,percent_m,is_south,mean_education,police_exp60,police_exp59,labour_participation,m_per1000f,state_pop,nonwhites_per1000,unemploy_m24,unemploy_m39,gdp,inequality,prob_prison,time_prison,crime_rate
nonwhites_per1000,0.593,0.767,0.665,0.214,0.219,0.341,0.327,0.095,1.0,0.156,0.081,0.59,0.677,0.428,0.23,0.033
unemploy_m24,0.224,0.172,0.018,0.044,0.052,0.229,0.352,0.038,0.156,1.0,0.746,0.045,0.064,0.007,0.17,0.05
percent_m,1.0,0.584,0.53,0.506,0.513,0.161,0.029,0.281,0.593,0.224,0.245,0.67,0.639,0.361,0.115,0.089
is_south,0.584,1.0,0.703,0.373,0.376,0.505,0.315,0.05,0.767,0.172,0.072,0.637,0.737,0.531,0.067,0.091
time_prison,0.115,0.067,0.254,0.103,0.076,0.124,0.428,0.464,0.23,0.17,0.101,0.001,0.102,0.436,1.0,0.15
unemploy_m39,0.245,0.072,0.216,0.185,0.169,0.421,0.019,0.27,0.081,0.746,1.0,0.092,0.016,0.062,0.101,0.177
inequality,0.639,0.737,0.769,0.631,0.648,0.27,0.167,0.126,0.677,0.064,0.016,0.884,1.0,0.465,0.102,0.179
labour_participation,0.161,0.505,0.561,0.121,0.106,1.0,0.514,0.124,0.341,0.229,0.421,0.295,0.27,0.25,0.124,0.189
m_per1000f,0.029,0.315,0.437,0.034,0.023,0.514,1.0,0.411,0.327,0.352,0.019,0.18,0.167,0.051,0.428,0.214
mean_education,0.53,0.703,1.0,0.483,0.499,0.561,0.437,0.017,0.665,0.018,0.216,0.736,0.769,0.39,0.254,0.323


Buatlah regresi linear sederhana untuk memprediksi `crime_rate` berdasarkan `police_exp60` lalu simpanlah model tersebut kedalam nama variabel `model`.

3. Dari pernyataan berikut, pilihlah satu atau lebih pernyataan yang paling tidak sesuai dengan nilai **slope** dari model.
- [ ] `police_exp60` signifikan berpengaruh terhadap `crime_rate`
- [x] `police_exp60` tidak signifikan berpengaruh terhadap `crime_rate`
- [ ] Dilihat dari nilai koefisiennya, `police_exp60` memberikan pengaruh positif terhadap `crime_rate`
- [ ] Hubungan dari prediktor terhadap target varibel dapat dilihat dari satuan koefisien

In [11]:
import statsmodels.api as sm

# your code here
x = sm.add_constant(crime['police_exp60'])
y = crime['crime_rate']

lm_summary = sm.OLS(y, x).fit()
lm_summary.summary()

0,1,2,3
Dep. Variable:,crime_rate,R-squared:,0.473
Model:,OLS,Adj. R-squared:,0.461
Method:,Least Squares,F-statistic:,40.36
Date:,"Mon, 08 Jan 2024",Prob (F-statistic):,9.34e-08
Time:,16:20:06,Log-Likelihood:,-331.16
No. Observations:,47,AIC:,666.3
Df Residuals:,45,BIC:,670.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,144.4640,126.693,1.140,0.260,-110.708,399.636
police_exp60,8.9485,1.409,6.353,0.000,6.111,11.786

0,1,2,3
Omnibus:,0.527,Durbin-Watson:,2.122
Prob(Omnibus):,0.768,Jarque-Bera (JB):,0.568
Skew:,-0.233,Prob(JB):,0.753
Kurtosis:,2.729,Cond. No.,275.0


Buatlah model baru dengan menggunakan seluruh variabel data kita untuk memprediksi nilai `crime_rate` dengan nama `model2`.

4. Dari pernyataan berikut, pilihlah pernyataan yang paling sesuai berdasarkan `model2` diatas?
- [ ] Adj. R-squared lebih rendah daripada nilai R-squared, artinya model kita belum dapat menangkap pola dari variabel target dengan baik.
- [ ] Dilihat dari koefisien model, `gdp` adalah variabel yang paling mempengaruhi tingginya tingkat kejahatan.
- [ ] Dilihat dari koefisien model, jika suatu daerah memiliki tingkat kejahatan yang sedikit atau bahkan 0, alokasi dari `police_exp60` bernilai sekitar 19.28
- [x] Dilihat dari koefisien model, ketika nilai `inequality` naik 1 nilai maka jumlah tingkat kejahatan akan meningkat 7.0672 dengan menganggap semua nilai variabel yang lain konstan

In [12]:
# your code here
x_all = sm.add_constant(crime.drop(columns='crime_rate'))

model2 = sm.OLS(y, x_all).fit()
model2.summary()

0,1,2,3
Dep. Variable:,crime_rate,R-squared:,0.803
Model:,OLS,Adj. R-squared:,0.708
Method:,Least Squares,F-statistic:,8.429
Date:,"Mon, 08 Jan 2024",Prob (F-statistic):,3.54e-07
Time:,16:23:51,Log-Likelihood:,-308.01
No. Observations:,47,AIC:,648.0
Df Residuals:,31,BIC:,677.6
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-5984.2876,1628.318,-3.675,0.001,-9305.265,-2663.310
percent_m,8.7830,4.171,2.106,0.043,0.275,17.291
is_south,-3.8035,148.755,-0.026,0.980,-307.192,299.585
mean_education,18.8324,6.209,3.033,0.005,6.169,31.495
police_exp60,19.2804,10.611,1.817,0.079,-2.361,40.922
police_exp59,-10.9422,11.748,-0.931,0.359,-34.902,13.018
labour_participation,-0.6638,1.470,-0.452,0.655,-3.661,2.334
m_per1000f,1.7407,2.035,0.855,0.399,-2.411,5.892
state_pop,-0.7330,1.290,-0.568,0.574,-3.363,1.897

0,1,2,3
Omnibus:,2.036,Durbin-Watson:,1.723
Prob(Omnibus):,0.361,Jarque-Bera (JB):,1.135
Skew:,0.198,Prob(JB):,0.567
Kurtosis:,3.651,Cond. No.,101000.0


Gunakan data `crime_test` untuk mengevaluasi hasil model multiple linear regression diatas (`model2`).

In [13]:
# your code here
crime_test = pd.read_csv('crime_test.csv')
crime_test.head()

Unnamed: 0,percent_m,is_south,mean_education,police_exp60,police_exp59,labour_participation,m_per1000f,state_pop,nonwhites_per1000,unemploy_m24,unemploy_m39,gdp,inequality,prob_prison,time_prison,crime_rate
0,123,0,102,97,87,526,948,113,76,124,50,572,158,0.021,37.401,653
1,177,1,87,58,56,638,974,24,349,76,28,382,254,0.045,31.7,831
2,152,1,87,57,53,530,986,30,72,92,43,405,264,0.069,22.701,798
3,124,0,105,121,116,580,966,101,106,77,35,657,170,0.016,41.6,1674
4,148,0,122,72,66,601,998,9,19,84,20,590,144,0.025,30.0,880


In [16]:
X_new = sm.add_constant(crime_test.drop(columns='crime_rate'))

crime_test['prediction'] = model2.predict(X_new)
crime_test.head()

Unnamed: 0,percent_m,is_south,mean_education,police_exp60,police_exp59,labour_participation,m_per1000f,state_pop,nonwhites_per1000,unemploy_m24,unemploy_m39,gdp,inequality,prob_prison,time_prison,crime_rate,prediction
0,123,0,102,97,87,526,948,113,76,124,50,572,158,0.021,37.401,653,737.789
1,177,1,87,58,56,638,974,24,349,76,28,382,254,0.045,31.7,831,971.151
2,152,1,87,57,53,530,986,30,72,92,43,405,264,0.069,22.701,798,903.354
3,124,0,105,121,116,580,966,101,106,77,35,657,170,0.016,41.6,1674,1161.329
4,148,0,122,72,66,601,998,9,19,84,20,590,144,0.025,30.0,880,823.742


5. RMSE atau Root Mean Squared Error adalah rata-rata error/selisih dari prediksi tingkat kejahatan dibandingkan data aslinya. Berapa nilai RMSE dari model tersebut?
- [x] 212.77
- [ ] 160.74
- [ ] 182.66
- [ ] 730.65

In [17]:
from statsmodels.tools.eval_measures import rmse
from statsmodels.tools.eval_measures import meanabs

# your code here
rmse(crime_test['prediction'], crime_test['crime_rate']) 

212.76547968229357

6. Salah satu asumsi dari regresi linear yang harus dipenuhi adalah asumsi bahwa error yang dihasilkan berdistribusi normal. Salah satu uji statistik yang dapat digunakan untuk membuktikan normalitas dari residual adalah uji Shapiro. Lakukan uji tersebut pada model multiple linear regression diatas. Manakah pernyataan yang sesuai berdasarkan uji Shapiro yang diperoleh?
- [ ] P-value > Alpha, maka model kita tidak lulus uji normality residual.
- [x] P-value > Alpha, maka model kita lulus uji normality residual.
- [ ] P-value < Alpha, maka model kita tidak lulus uji normality residual.
- [ ] P-value < Alpha, maka model kita lulus uji normality residual.

In [19]:
# your code here
from scipy.stats import shapiro
shapiro(model2.resid)

ShapiroResult(statistic=0.9846006631851196, pvalue=0.7848900556564331)