Şimdi, ilk regresyon modelinizi tasarlama sırası sizde. Bu bölümde, Kaggle'ın "Ev Fiyatları" veri kümesini kullanacaksınız. Bu veri kümesi, Ames/Iowa'da ki konutların birçok yönünü açıklayan 79 değişkenden oluşmaktadır. Göreviniz bu verileri kullanarak, evlerin fiyatlarını tahmin etmek olacak. Verileri ve değişkenlerin açıklamalarını burada bulabilirsiniz: Ev Fiyatları

Verileri inceleyin ve gerekli tüm verileri temizleyin.
Verileri araştırın ve konut fiyatlarının tahmininde faydalı olacağını düşündüğünüz bazı değişkenleri bulun.
Bu özellikleri kullanarak ilk modelinizi oluşturun ve OLS kullanarak parametreleri tahmin edin.

In [1]:
# importing modules which are going to use during EDA

import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
from scipy.stats.mstats import winsorize
from statsmodels.stats.weightstats import ttest_ind
from scipy.stats import norm,bernoulli, exponnorm
from sklearn.preprocessing import StandardScaler
from matplotlib.ticker import FuncFormatter
import warnings
%matplotlib inline 
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import linear_model
warnings.filterwarnings('ignore')

In [2]:
# the first look through the data 

Train = pd.read_csv("train.csv", encoding = "ISO-8859-1")
Train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
# data types and numbers of variables

Train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

In [4]:
np.abs(Train.corr().iloc[-1, 0:-1]).sort_values(ascending=False)

OverallQual      0.790982
GrLivArea        0.708624
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
YearBuilt        0.522897
YearRemodAdd     0.507101
GarageYrBlt      0.486362
MasVnrArea       0.477493
Fireplaces       0.466929
BsmtFinSF1       0.386420
LotFrontage      0.351799
WoodDeckSF       0.324413
2ndFlrSF         0.319334
OpenPorchSF      0.315856
HalfBath         0.284108
LotArea          0.263843
BsmtFullBath     0.227122
BsmtUnfSF        0.214479
BedroomAbvGr     0.168213
KitchenAbvGr     0.135907
EnclosedPorch    0.128578
ScreenPorch      0.111447
PoolArea         0.092404
MSSubClass       0.084284
OverallCond      0.077856
MoSold           0.046432
3SsnPorch        0.044584
YrSold           0.028923
LowQualFinSF     0.025606
Id               0.021917
MiscVal          0.021190
BsmtHalfBath     0.016844
BsmtFinSF2       0.011378
Name: SalePrice, dtype: float64

In [5]:
# info of NaN in our data set as percentage

def show_missing (df):
    """This function returns percentage and total number of missing values"""
    percent = df.isnull().sum()*100/df.shape[0]
    total = df.isnull().sum()
    missing = pd.concat([percent, total], axis=1, keys=['percent', 'total'])
    return missing[missing.total>0].sort_values('total', ascending=False)

In [6]:
Missing=show_missing(Train)
Missing

Unnamed: 0,percent,total
PoolQC,99.520548,1453
MiscFeature,96.30137,1406
Alley,93.767123,1369
Fence,80.753425,1179
FireplaceQu,47.260274,690
LotFrontage,17.739726,259
GarageType,5.547945,81
GarageYrBlt,5.547945,81
GarageFinish,5.547945,81
GarageQual,5.547945,81


In [7]:
# dropping the NaNs from 'City' and 'Target'
# Because of the low percentage of "City"s and "Target"s NANs, we need to drop them.

Traindrop=Train.dropna(subset=list(Missing[Missing.total<10].index))

In [8]:
Traindrop.drop(list(Missing[Missing.percent>20].index), axis=1, inplace=True)

In [22]:
Last=show_missing (Traindrop)
Last

Unnamed: 0,percent,total
GarageType,5.582357,81
GarageFinish,5.582357,81
GarageQual,5.582357,81
GarageCond,5.582357,81
BsmtExposure,2.618884,38
BsmtFinType2,2.618884,38
BsmtQual,2.549966,37
BsmtCond,2.549966,37
BsmtFinType1,2.549966,37


In [25]:
Numerik_Değil=Traindrop.dtypes[Traindrop.dtypes == "object"].index
Numerik_Değil

Index(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
       'PavedDrive', 'SaleType', 'SaleCondition'],
      dtype='object')

In [30]:
Traindrop.BsmtQual.mode()

0    TA
dtype: object

In [29]:
Traindrop.BsmtQual.value_counts()

TA    648
Gd    611
Ex    120
Fa     35
Name: BsmtQual, dtype: int64

In [31]:
for Col in Numerik_Değil:
    Traindrop[Col]=Traindrop[Col].fillna(Traindrop[Col].mode())

In [32]:
Last=show_missing (Traindrop)
Last

Unnamed: 0,percent,total
GarageType,5.582357,81
GarageFinish,5.582357,81
GarageQual,5.582357,81
GarageCond,5.582357,81
BsmtExposure,2.618884,38
BsmtFinType2,2.618884,38
BsmtQual,2.549966,37
BsmtCond,2.549966,37
BsmtFinType1,2.549966,37


In [23]:
Numerik=Traindrop.dtypes[Traindrop.dtypes != "object"].index
Numerik

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

In [21]:
for Col in Numerik:
    Traindrop[Col]=Traindrop[Col].fillna(Traindrop[Col].mean())

In [34]:
Corr_List=list(np.abs(Train.corr().iloc[-1,0:-1]).sort_values(ascending=False).index[0:20])

In [35]:
TrainFinal=Traindrop[Corr_List+['SalePrice']].dropna()

In [36]:
y = TrainFinal['SalePrice']
X = TrainFinal.drop('SalePrice', axis=1)

In [37]:
lrm = linear_model.LinearRegression()
lrm.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [38]:
print('Değişkenler: \n', lrm.coef_)
print('Sabit değer (bias): \n', lrm.intercept_)

Değişkenler: 
 [ 1.91242337e+04  1.83870274e+01  1.03763894e+04  6.75296329e+00
  1.05380444e+01  2.68919155e+01 -2.63906258e+03  1.72345133e+03
  1.57037858e+02  3.47639811e+02  6.16850981e+01  2.98419493e+01
  6.98335391e+03  1.64916611e+01  9.57978323e+00  2.66916559e+01
  2.06744118e+01  6.93430590e+00 -1.72858379e+03  4.92028443e-01]
Sabit değer (bias): 
 -1186150.4775977298


In [39]:
import statsmodels.api as sm

X = sm.add_constant(X)
results = sm.OLS(y, X).fit()
results.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.795
Model:,OLS,Adj. R-squared:,0.792
Method:,Least Squares,F-statistic:,277.7
Date:,"Wed, 23 Oct 2019",Prob (F-statistic):,0.0
Time:,10:28:19,Log-Likelihood:,-17277.0
No. Observations:,1451,AIC:,34600.0
Df Residuals:,1430,BIC:,34710.0
Df Model:,20,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.186e+06,1.38e+05,-8.610,0.000,-1.46e+06,-9.16e+05
OverallQual,1.912e+04,1175.772,16.265,0.000,1.68e+04,2.14e+04
GrLivArea,18.3870,20.510,0.897,0.370,-21.845,58.619
GarageCars,1.038e+04,2969.048,3.495,0.000,4552.233,1.62e+04
GarageArea,6.7530,10.280,0.657,0.511,-13.413,26.919
TotalBsmtSF,10.5380,4.281,2.462,0.014,2.141,18.935
1stFlrSF,26.8919,21.011,1.280,0.201,-14.324,68.108
FullBath,-2639.0626,2857.550,-0.924,0.356,-8244.502,2966.376
TotRmsAbvGrd,1723.4513,1087.085,1.585,0.113,-409.000,3855.903

0,1,2,3
Omnibus:,682.127,Durbin-Watson:,1.974
Prob(Omnibus):,0.0,Jarque-Bera (JB):,117806.065
Skew:,-1.08,Prob(JB):,0.0
Kurtosis:,47.09,Cond. No.,2150000.0
