### Packages and Data

Important Functions:
https://scikit-learn.org/stable/modules/feature_selection.html

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import mutual_info_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics

In [2]:
data = pd.read_csv("train.csv")
macro = pd.read_csv("macro.csv")

### Feature Constraints

1. As many rows as possible, don't drop rows unless you have good reason
2. No more than 50 features
    + At most 10 of those come from macro.csv
3. Don't dummify geographic variables
    + You can dummify product_type if you want
4. Can't include timestamp
    + Some macro.csv variables can stand in for these cause they're by day
5. You can combine columns from macro.csv and train.csv to make new features
6. Can use any packages you want
7. All variables will be numeric
8. Must be able to explain why you chose all your variables

### Data Cleaning

In [3]:
# Convert timestamp in both data frames to datetime objects
macro['timestamp'] = pd.to_datetime(macro['timestamp'])
data['timestamp'] = pd.to_datetime(data['timestamp'])

#### Main Training Data

In [4]:
# remove id and timestamp, and remove NA for simplicity
data2 = data.drop(["id", "timestamp"], axis=1).dropna(axis=0)

# save this one categorical variable so I can append it later (it only has two levels and we're allowed to use it)
prod_type = data2["product_type"]

# extract and save response variable
y = data2['price_doc']

data_numeric = data2.select_dtypes(exclude=['object'])
data_numeric['is_investment'] = np.where(prod_type == 'Investment', 1, 0)
X = data_numeric.drop(['price_doc'], axis=1) # remove response variable
data_numeric.dtypes.unique() # ensure that all variables are numeric

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


array([dtype('int64'), dtype('float64'), dtype('int32')], dtype=object)

#### Macro data

There are only 9 columns in macro that don't have NAs, one being timestamp

In [5]:
# only keep numeric columns (this does not remove timestamp)
macro_numeric = macro.select_dtypes(exclude=['object'])

# list the colums that have less than 90 missing values
low_na_cols = list(pd.DataFrame(macro_numeric.isnull().sum(axis=0)).reset_index().sort_values(by=0).iloc[:23, ]['index'])

In [6]:
macro2 = pd.merge(macro_numeric[low_na_cols], data[['timestamp', 'price_doc']], how='left', left_on='timestamp', right_on='timestamp').dropna(axis=0).drop(['timestamp'], axis=1)
X_macro = macro2.drop('price_doc', axis=1)
y_macro = macro2['price_doc']

### Feature Selection from Main Training Data

#### Choosing Features by Eye

1.  Property attributes:
    + full_sq
    + floor
    + build_year
    + num_room
    + state (condition)

2. Neighborhood attributes:
    + full_all (population, but there's also raion_popul and idk the difference)
    + work_all (working age population)
    + young_all (lower than working age population)
    + green_zone_part (proportion of area that's greenery)
    + indust_part (proportion of area that's industrial)
    + culture_objects_top_25_raion (number of objects of cultural heritage)
    + shopping_center_raion
    + oil_chemistry_raion (presence of dirty industries)
    + radiation_raion (presence of radiation disposal)
    + big_market_raion (presence of large grocery / wholesale markets)
    + detention_facility_raion (presence of prison)
    + metro_min_walk (walking time to metro)
    + park_km (distance to park)
    + railroad_station_walk_km (distance to railroad station walking)
    + kremlin_km (distance to city center/Kremlin)
    

#### Using F-Statistics

This one computes F statistics for a regression with response vs. each predictor variable one by one, then takes the top 40 most significant of those F-stats

In [7]:
# Define and fit feature selector
F_selector = SelectKBest(f_regression, k=40)
F_selector.fit(X, y)

# Extract column indices from feature selector
F_cols = F_selector.get_support(indices=True)

# Get column names of selected columns
F_colnames = list(X.iloc[:, F_cols].columns)
F_colnames

['full_sq',
 'life_sq',
 'max_floor',
 'num_room',
 'kitch_sq',
 'ID_metro',
 'mkad_km',
 'sadovoe_km',
 'bulvar_ring_km',
 'kremlin_km',
 'office_count_500',
 'office_count_1000',
 'cafe_count_1000_price_high',
 'cafe_count_1500_price_high',
 'office_sqm_2000',
 'cafe_sum_2000_min_price_avg',
 'sport_count_2000',
 'office_sqm_3000',
 'cafe_sum_3000_min_price_avg',
 'cafe_sum_3000_max_price_avg',
 'cafe_avg_price_3000',
 'cafe_count_3000_price_high',
 'sport_count_3000',
 'office_count_5000',
 'office_sqm_5000',
 'trc_count_5000',
 'cafe_count_5000',
 'cafe_sum_5000_min_price_avg',
 'cafe_sum_5000_max_price_avg',
 'cafe_avg_price_5000',
 'cafe_count_5000_na_price',
 'cafe_count_5000_price_500',
 'cafe_count_5000_price_1000',
 'cafe_count_5000_price_1500',
 'cafe_count_5000_price_2500',
 'cafe_count_5000_price_4000',
 'cafe_count_5000_price_high',
 'church_count_5000',
 'leisure_count_5000',
 'sport_count_5000']

#### Using Mutual Information

This one takes a while. It's based on entropy estimation from k-nearest neighbors distances. 

In [8]:
# Define and fit feature selector
mi_selector = SelectKBest(mutual_info_regression, k=40)
mi_selector.fit(X, y)

# Extract column indices from feature selector
mi_cols = mi_selector.get_support(indices=True)

# Get column names of selected columns
mi_colnames = list(X.iloc[:, mi_cols].columns)

mi_colnames

['full_sq',
 'life_sq',
 'build_year',
 'num_room',
 'kitch_sq',
 'preschool_quota',
 'children_school',
 'young_all',
 'young_male',
 'young_female',
 'work_male',
 'work_female',
 'ekder_male',
 '0_6_male',
 '7_14_all',
 '7_14_male',
 '7_14_female',
 '0_17_all',
 '0_17_male',
 '0_17_female',
 '0_13_all',
 '0_13_male',
 'raion_build_count_with_material_info',
 'build_count_panel',
 'raion_build_count_with_builddate_info',
 'build_count_1946-1970',
 'ID_metro',
 'ID_railroad_station_walk',
 'ID_railroad_station_avto',
 'cafe_count_2000',
 'cafe_count_2000_price_2500',
 'trc_sqm_3000',
 'cafe_count_3000',
 'cafe_count_3000_price_1500',
 'cafe_count_3000_price_2500',
 'office_count_5000',
 'office_sqm_5000',
 'cafe_count_5000',
 'cafe_count_5000_price_1000',
 'cafe_count_5000_price_2500']

### Comparing Training Feature Sets

**Evaluating Model with Features from F tests**

In [9]:
# Define the X matrix for this model
X_F = X.iloc[:, F_cols]

# Define the model
F_model = LinearRegression()

# Get 10 R^2 values from a 10-fold cross-validation
F_R_sqs = cross_val_score(
    F_model, X_F, y, cv=10, scoring='r2')

F_R_sq_mean = F_R_sqs.mean()

print("=" * (29 + len(str(F_R_sq_mean))))
print("10-fold Cross-Validated R^2: {}".format(F_R_sq_mean))
print("=" * (29 + len(str(F_R_sq_mean))))

10-fold Cross-Validated R^2: 0.5059986129756255


**Evaluating Model with Features from Mutual Information**

In [10]:
# Define the X matrix for this model
X_mi = X.iloc[:, mi_cols]

# Define the model
mi_model = LinearRegression()

# Get 10 R^2 values from a 10-fold cross-validation
mi_R_sqs = cross_val_score(
    mi_model, X_mi, y, cv=10, scoring='r2')

mi_R_sq_mean = mi_R_sqs[1:].mean() # The first value gets really messed up so I skipped it

print("=" * (29 + len(str(mi_R_sq_mean))))
print("10-fold Cross-Validated R^2: {}".format(mi_R_sq_mean))
print("=" * (29 + len(str(mi_R_sq_mean))))

10-fold Cross-Validated R^2: 0.503848832181219


### Feature Selection from Macro Data

#### Choosing Features by Eye

+ oil_urals: Crude Oil Urals ($ / bbl)
+ gdp_quart: GDP
+ gdp_quart_growth: Real GDP growth
+ cpi: Inflation - Consumer Price Index Growth
+ usdrub: Ruble/USD exchange rate
+ deposits_rate: Average interest rate on deposits
+ mortgage_value: Volume of mortgage loans
+ income_per_cap: Average income per capita 
+ unemployment: Unemployment rate
+ invest_fixed_capital_per_cap: Investments in fixed capital per capita

#### Using F-Statistics

In [11]:
# Define and fit feature selector
F_selector_m = SelectKBest(f_regression, k=10)
F_selector_m.fit(X_macro, y_macro)

# Extract column indices from feature selector
F_cols_m = F_selector_m.get_support(indices=True)

# Get column names of selected columns
F_colnames_m = list(X_macro.iloc[:, F_cols_m].columns)
F_colnames_m

['fixed_basket',
 'deposits_value',
 'gdp_annual',
 'gdp_annual_growth',
 'usdrub',
 'eurrub',
 'rts',
 'micex_rgbi_tr',
 'cpi',
 'ppi']

#### Using Mutual Information

In [12]:
# Define and fit feature selector
mi_selector_m = SelectKBest(mutual_info_regression, k=10)
mi_selector_m.fit(X_macro, y_macro)

# Extract column indices from feature selector
mi_cols_m = mi_selector_m.get_support(indices=True)

# Get column names of selected columns
mi_colnames_m = list(X_macro.iloc[:, mi_cols_m].columns)
mi_colnames_m

['oil_urals',
 'mortgage_value',
 'deposits_value',
 'average_provision_of_build_contract',
 'cpi',
 'ppi',
 'deposits_growth',
 'balance_trade_growth',
 'gdp_quart',
 'gdp_quart_growth']

### Comparing Macro Feature Sets

**Evaluating Model with Features from F tests**

In [13]:
# Define the X matrix for this model
X_F_m = X_macro[F_colnames_m]

# Define the model
F_model_m = LinearRegression()

# Get 10 R^2 values from a 10-fold cross-validation
F_R_sqs_m = cross_val_score(
    F_model_m, X_F_m, y_macro, cv=10, scoring='r2')

F_R_sq_mean_m = F_R_sqs_m.mean()

print("=" * (29 + len(str(F_R_sq_mean_m))))
print("10-fold Cross-Validated R^2: {}".format(F_R_sq_mean_m))
print("=" * (29 + len(str(F_R_sq_mean_m))))

10-fold Cross-Validated R^2: -0.0093914444244658


**Evaluating Model with Features from Mutual Information**

In [14]:
# Define the X matrix for this model
X_mi_m = X_macro[mi_colnames_m]

# Define the model
mi_model_m = LinearRegression()

# Get 10 R^2 values from a 10-fold cross-validation
mi_R_sqs_m = cross_val_score(
    mi_model_m, X_mi_m, y_macro, cv=10, scoring='r2')

mi_R_sq_mean_m = mi_R_sqs_m.mean() # The first value gets really messed up so I skipped it

print("=" * (29 + len(str(mi_R_sq_mean_m))))
print("10-fold Cross-Validated R^2: {}".format(mi_R_sq_mean_m))
print("=" * (29 + len(str(mi_R_sq_mean_m))))

10-fold Cross-Validated R^2: -0.008642195862967906
