<a href="https://colab.research.google.com/github/2876mohira/vizualizatsiya/blob/main/05_ml_05_amaliyot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Imgur](https://i.imgur.com/5pXzCIu.png)

# Data Science va Sun'iy Intellekt Praktikum

## 5-MODUL. Machine Learning

### Portfolio uchun vazifa: Toshkent shahrida uylarning narxini aniqlash.

Ushbu amaliyotda sizning vazifangiz berilgan ma`lumotlar asosida Toshkent shahridagi uylarning narxini aniqlash.

In [42]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/anvarnarz/praktikum_datasets/main/housing_data_08-02-2021.csv')
df.head()

Unnamed: 0,location,district,rooms,size,level,max_levels,price
0,"город Ташкент, Юнусабадский район, Юнусабад 8-...",Юнусабадский,3,57,4,4,52000
1,"город Ташкент, Яккасарайский район, 1-й тупик ...",Яккасарайский,2,52,4,5,56000
2,"город Ташкент, Чиланзарский район, Чиланзар 2-...",Чиланзарский,2,42,4,4,37000
3,"город Ташкент, Чиланзарский район, Чиланзар 9-...",Чиланзарский,3,65,1,4,49500
4,"город Ташкент, Чиланзарский район, площадь Актепа",Чиланзарский,3,70,3,5,55000


# Ustunlar ta'rifi
- `location` - sotilayotgan uy manzili
- `district` - uy joylashgan tuman
- `rooms` - xonalar soni
- `size` - uy maydoni (kv.m)
- `level` - uy joylashgan qavat
- `max_levels` - ja'mi qavatlar soni
- `price` - uy narxi

## Vazifani CRSIP-DM Metolodgiyasi yordamida bajaring.
<img src="https://i.imgur.com/dzZnnYi.png" alt="CRISP-DM" width="800"/>

In [43]:
import pandas as pd
import numpy as np
import sklearn # scikit-learn kutubxonasi

In [44]:
# Onlayn dataset joylashgan manzilini ko'rsatamiaz
URL = "https://github.com/ageron/handson-ml2/blob/master/datasets/housing/housing.csv?raw=true"
df = pd.read_csv(URL)

In [45]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

X_train = train_set.drop("median_house_value", axis=1)
y = train_set["median_house_value"].copy()

X_num = X_train.drop("ocean_proximity", axis=1)

In [46]:
from sklearn.base import BaseEstimator, TransformerMixin
# bizga kerak ustunlar indekslari
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self # bizni funksiyamiz faqat transformer. estimator emas
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room: # add_bedrooms_per_room ustuni ixtiyoriy bo'ladi
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

In [47]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

num_pipeline = Pipeline([
          ('imputer', SimpleImputer(strategy='median')),
          ('attribs_adder', CombinedAttributesAdder(add_bedrooms_per_room = True)),
          ('std_scaler', StandardScaler())
])

In [48]:
from sklearn.compose import ColumnTransformer

num_attribs = list(X_num)
cat_attribs = ['ocean_proximity']

full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs)
])

In [49]:
X_prepared = full_pipeline.fit_transform(X_train)

In [50]:
X_prepared[0:5,:]

array([[ 1.27258656, -1.3728112 ,  0.34849025,  0.22256942,  0.21122752,
         0.76827628,  0.32290591, -0.326196  , -0.17491646,  0.05137609,
        -0.2117846 ,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [ 0.70916212, -0.87669601,  1.61811813,  0.34029326,  0.59309419,
        -0.09890135,  0.6720272 , -0.03584338, -0.40283542, -0.11736222,
         0.34218528,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [-0.44760309, -0.46014647, -1.95271028, -0.34259695, -0.49522582,
        -0.44981806, -0.43046109,  0.14470145,  0.08821601, -0.03227969,
        -0.66165785,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [ 1.23269811, -1.38217186,  0.58654547, -0.56148971, -0.40930582,
        -0.00743434, -0.38058662, -1.01786438, -0.60001532,  0.07750687,
         0.78303162,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [-0.10855122,  0.5320839 ,  1

In [51]:
from sklearn.linear_model import LinearRegression

LR_model = LinearRegression()

In [52]:
LR_model.fit(X_prepared, y)

In [53]:
# tasodifiy 5 ta qatorni ajratib olamiz
test_data = X_train.sample(5)
test_data

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
5627,-118.27,33.77,39.0,1731.0,485.0,2115.0,478.0,1.5369,NEAR OCEAN
9297,-122.57,38.03,24.0,2330.0,322.0,911.0,320.0,6.5253,NEAR BAY
19109,-122.63,38.23,45.0,2264.0,504.0,1076.0,472.0,3.0139,<1H OCEAN
1495,-122.01,37.94,26.0,1619.0,224.0,706.0,220.0,6.0704,NEAR BAY
13392,-117.57,34.07,4.0,2152.0,580.0,1083.0,441.0,3.1458,INLAND


In [54]:
# yuqoridagi qatorlarga mos keluvchi narxlarni ajratib olamiz (biz aynan shu qiymatlarni bashorat qilishimiz kerak)
test_label = y.loc[test_data.index]
test_label

5627     141300.0
9297     387700.0
19109    194100.0
1495     268000.0
13392    118800.0
Name: median_house_value, dtype: float64

In [55]:
test_data_prepared = full_pipeline.transform(test_data)
test_data_prepared

array([[ 0.6543155 , -0.87669601,  0.8246007 , -0.4189335 , -0.12767915,
         0.60557054, -0.05771505, -1.23086316, -0.75980989,  0.11467264,
         1.16093584,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [-1.489689  ,  1.11712539, -0.36567544, -0.14347812, -0.51670582,
        -0.45333602, -0.47246064,  1.38876977,  0.77326393, -0.02159939,
        -1.28739213,  0.        ,  0.        ,  0.        ,  1.        ,
         0.        ],
       [-1.51960534,  1.21073203,  1.30071116, -0.1738288 , -0.08233248,
        -0.30822009, -0.07346488, -0.45522411, -0.26750897, -0.07058839,
         0.1682403 ,  1.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [-1.21046981,  1.0750024 , -0.20697195, -0.47043768, -0.75059916,
        -0.63363157, -0.73495785,  1.14988134,  0.80587015,  0.00968439,
        -1.28464035,  0.        ,  0.        ,  0.        ,  1.        ,
         0.        ],
       [ 1.00333949, -0.73628606, -1

In [56]:
predicted_data = LR_model.predict(test_data_prepared)
predicted_data

array([129052.40744923, 320695.41500964, 228429.92373788, 285706.3022083 ,
       132267.15413422])

In [57]:
pd.DataFrame({'Prognoz':predicted_data, 'Real baxosi': test_label})

Unnamed: 0,Prognoz,Real baxosi
5627,129052.407449,141300.0
9297,320695.41501,387700.0
19109,228429.923738,194100.0
1495,285706.302208,268000.0
13392,132267.154134,118800.0


In [58]:
test_set

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
20046,-119.01,36.06,25.0,1505.0,,1392.0,359.0,1.6812,47700.0,INLAND
3024,-119.46,35.14,30.0,2943.0,,1565.0,584.0,2.5313,45800.0,INLAND
15663,-122.44,37.80,52.0,3830.0,,1310.0,963.0,3.4801,500001.0,NEAR BAY
20484,-118.72,34.28,17.0,3051.0,,1705.0,495.0,5.7376,218600.0,<1H OCEAN
9814,-121.93,36.62,34.0,2351.0,,1063.0,428.0,3.7250,278000.0,NEAR OCEAN
...,...,...,...,...,...,...,...,...,...,...
15362,-117.22,33.36,16.0,3165.0,482.0,1351.0,452.0,4.6050,263300.0,<1H OCEAN
16623,-120.83,35.36,28.0,4323.0,886.0,1650.0,705.0,2.7266,266800.0,NEAR OCEAN
18086,-122.05,37.31,25.0,4111.0,538.0,1585.0,568.0,9.2298,500001.0,<1H OCEAN
2144,-119.76,36.77,36.0,2507.0,466.0,1227.0,474.0,2.7850,72300.0,INLAND


In [59]:
X_test = test_set.drop('median_house_value', axis=1)
X_test

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
20046,-119.01,36.06,25.0,1505.0,,1392.0,359.0,1.6812,INLAND
3024,-119.46,35.14,30.0,2943.0,,1565.0,584.0,2.5313,INLAND
15663,-122.44,37.80,52.0,3830.0,,1310.0,963.0,3.4801,NEAR BAY
20484,-118.72,34.28,17.0,3051.0,,1705.0,495.0,5.7376,<1H OCEAN
9814,-121.93,36.62,34.0,2351.0,,1063.0,428.0,3.7250,NEAR OCEAN
...,...,...,...,...,...,...,...,...,...
15362,-117.22,33.36,16.0,3165.0,482.0,1351.0,452.0,4.6050,<1H OCEAN
16623,-120.83,35.36,28.0,4323.0,886.0,1650.0,705.0,2.7266,NEAR OCEAN
18086,-122.05,37.31,25.0,4111.0,538.0,1585.0,568.0,9.2298,<1H OCEAN
2144,-119.76,36.77,36.0,2507.0,466.0,1227.0,474.0,2.7850,INLAND


In [60]:
y_test = test_set['median_house_value'].copy()
y_test

20046     47700.0
3024      45800.0
15663    500001.0
20484    218600.0
9814     278000.0
           ...   
15362    263300.0
16623    266800.0
18086    500001.0
2144      72300.0
3665     151500.0
Name: median_house_value, Length: 4128, dtype: float64

In [61]:
X_test_prepared = full_pipeline.transform(X_test)

Bashorat

In [62]:
y_predicted = LR_model.predict(X_test_prepared)

Bashorat va real datani solishtirish uchun avvalgi bo'limda ko'rgan Root mean square error (RMSE) dan foydalanamiz:


In [63]:
from sklearn.metrics import mean_squared_error
lin_mse = mean_squared_error(y_test, y_predicted)
# RMSE hisoblaymiz
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

72701.32600762138


Demak, RMSE=72701$ chiqdi. Yomon emas, lekin yaxshi ham emas. Ya'ni modelimiz uylarni baholashda o'rtacha 72000$ ga adashayapti.

Model aniqligini oshirish uchun yagona, universal yechim yo'q. Qilib ko'rishingiz mumkin bo'lgan ishlar:

Yaxhsiroq paramterlar topish
Yaxhsiroq model (algoritm) tanlash
Ko'proq ma'lumot yig'ish va hokazo.
Biz hozir boshqa model bilan sinab ko'ramiz.

DecisionTree


In [64]:
from sklearn.tree import DecisionTreeRegressor
Tree_model = DecisionTreeRegressor()
Tree_model.fit(X_prepared, y)

DecisionTreeRegressor()
Modelni tekshiramiz:

In [65]:
y_predicted = Tree_model.predict(X_test_prepared)

In [66]:
lin_mse = mean_squared_error(y_test, y_predicted)
# RMSE hisoblaymiz
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

72209.83797973677


Avvalgidan katta farq qilmadi.

RandomForest

In [67]:
from sklearn.ensemble import RandomForestRegressor
RF_model = RandomForestRegressor()
RF_model.fit(X_prepared, y)

RandomForestRegressor()
Modelni tekshiramiz:


In [68]:
y_predicted = RF_model.predict(X_test_prepared)
lin_mse = mean_squared_error(y_test, y_predicted)
# RMSE hisoblaymiz
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

50260.330930746095


Avvalgidan yaxhsiroq.

Cross-Validation usuli bilan baholash
Yuqorida biz modelni baholash uchun ma'lumotlarni test va train setlarga ajratdik.
Bu usulning kamchiligi biz test va train uchun doim bir xil ma'lumotlardan foydalanayapmiz.

Cross-validation yordamida biz ma'lumotlarni bir necha qismga ajratib, modelni turli qismlar yordamida bir nechta bor train va test qilishimiz mumkin.

Misol uchun, quyidagi rasmda ma'lumotlarni 5 ga ajratib train va test qilish ko'rsatilgan.

Cross validation uchun ma'lumotlarni train va testga bo'lish shart emas, buni sklearn o'zi qiladi.

In [69]:
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"].copy()

X_prepared = full_pipeline.transform(X)

Validation natijalarini ko'rsatish uchun sodda funksiya yasab olamiz

In [70]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Std.dev:", scores.std())

Cross-validation


In [30]:
from sklearn.model_selection import cross_val_score

LogisticRegression

In [71]:
scores = cross_val_score(LR_model, X_prepared, y, scoring="neg_mean_squared_error", cv=10)
LR_rmse_scores = np.sqrt(-scores)

In [72]:
display_scores(LR_rmse_scores)

Scores: [84188.51219065 61197.24357613 86752.24346334 62289.14292385
 80540.40041898 68919.39949642 52503.82940087 90910.07884989
 77674.67507925 53941.60539478]
Mean: 71891.71307941683
Std.dev: 13249.525989444988


Decision Tree

In [73]:
scores = cross_val_score(Tree_model, X_prepared, y, scoring="neg_mean_squared_error", cv=10)
LR_rmse_scores = np.sqrt(-scores)
display_scores(LR_rmse_scores)

Scores: [119097.89834814  71214.61299957  84153.06690169  76309.91107474
  90548.86275114  78431.87333729  67720.51187421  98357.89073599
  93901.00797601  74255.52406321]
Mean: 85399.1160061995
Std.dev: 14740.968077054213


Random Forest

In [74]:
scores = cross_val_score(RF_model, X_prepared, y, scoring="neg_mean_squared_error", cv=10)
LR_rmse_scores = np.sqrt(-scores)
display_scores(LR_rmse_scores)

Scores: [98260.36168834 47625.46853828 65617.08383644 56598.00682574
 61138.24843197 59979.33379031 46407.70545438 79493.0694942
 74410.27244613 49369.83207597]
Mean: 63889.9382581758
Std.dev: 15477.325302679006


Modelni saqlash
Yaratgan modelimizdan kelajakda foydalanish uchun saqlab qo'yishimiz lozim. Umuman olganda nafaqat model, balki boshqa kerak bo'ladigan o'zgaruvchilarni ham saqlab qo'yish maqsadga muvvofiq bo'ladi. Masalan pipeline.

Buning uchun Pythondagi pickle yoki joblib modullaridan foydalanamiz.

In [75]:
import pickle

filename = 'RF_model.pkl' # faylga istalgan nom beramiz
with open(filename, 'wb') as file:
    pickle.dump(RF_model, file)

Modelni qayta o'qiymiz:

In [76]:
with open(filename, 'rb') as file:
    model = pickle.load(file)

Modelni sinab ko'ramiz

In [77]:
scores = cross_val_score(model, X_prepared, y, scoring="neg_mean_squared_error", cv=5)
LR_rmse_scores = np.sqrt(-scores)
display_scores(LR_rmse_scores)


Scores: [76333.75119152 64095.39825925 61489.78644804 79601.20017315
 62315.3815087 ]
Mean: 68767.10351613132
Std.dev: 7629.425902319825


joblib yordamida saqlash
joblib katta NumPy martrisalarni siqib saqlash uchun afzal.

joblib o'rnatilmagan bo'lsa pip install joblib yordamida o'rnatib olamiz!

In [78]:
import joblib

filename = 'RF_model.jbl' # faylga istalgan nom beramiz
joblib.dump(RF_model, filename)

['RF_model.jbl']

Modelni o'qiymiz

In [79]:
model = joblib.load(filename)

In [80]:
scores = cross_val_score(model, X_prepared, y, scoring="neg_mean_squared_error", cv=5)
LR_rmse_scores = np.sqrt(-scores)
display_scores(LR_rmse_scores)

Scores: [77402.93010389 64327.71334946 60997.36690539 80048.34442608
 62539.28199951]
Mean: 69063.12735686683
Std.dev: 8003.35554184075


pipeline saqlab olamiz

In [81]:
filename = 'pipeline.jbl'
joblib.dump(full_pipeline, filename)

['pipeline.jbl']