# Long Notebook - Active

#### Emil B. Berglund - 529222 & Louis H. H. Linnerud - 539305, Team: Noe Lættis 

#### Table of contents:
1. Exploratory data analysis
2. Feature Engineering 
3. Models/Predictors
    - LightGBM
    - Random Forest Regressor
4. Model Interpretations
    - feature importance
5. Improved models (possibly)



# ___________ _0. Setup_ ___________

In [None]:
import numpy as np
import math as mt
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sklearn.metrics as metrics
import sklearn.ensemble as ensemble
import optuna
import lightgbm as lgb
import catboost as cb
import featuretools as ft
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold, cross_val_score
from sklearn.metrics import log_loss, mean_squared_error, mean_squared_log_error
from sklearn.preprocessing import LabelEncoder
from verstack import LGBMTuner, MeanTargetEncoder, OneHotEncoder


#from pandas_profiling import ProfileReport

In [None]:
def writeResultToFile(test_data, pred_data, nameOfFile='namelessSubmission'):
    submission = pd.DataFrame()
    submission['id'] = test_data['store_id']
    submission['predicted'] = np.asarray(pred_data)
    submission.to_csv('submissionFiles/'+ nameOfFile+'.csv', index=False)
    

In [None]:
def rmsle(y_true, y_pred):
    return metrics.mean_squared_log_error(y_true, y_pred)**0.5

# ___________ _1. Exploratory Data Analysis_ ___________


### EDA Notes
- [x] Search domain knowledge
- [x] Check if the data is intuitive
- [ ] Understand how the data was generated
- [x] Explore individual features
    - [x] Agencies
    - [x] stores with 0 revenue
    - [x] food and drink stores and grovery stores
- [x] Explore pairs and groups
    - [x] Store type vs revenue
    - [x] Geo position of stores in train and test set
    - [x] Revenue based on geo position
- [x] Clean up features
    - [x] remove 2016
    - [x] remove outliers
    - [x] remove 0 revenue rows


#### Domain Knowledge

Retailers obviously earn their revenue from sales, different retailers sell different products to different customers. Different products have different margins and number of sales, directly impacting the revenue. Number of sales most likely have a high correlation with number of costumers, areas with a high population density will therefor most likely have a higher number of customers, impacting number of sales and then impacting the revenue. Therefor retailer type and geographical position most likely have a high impact on revenue. Only knowing those two attributes can be a good pin pointer, but not necessary enough as described in this article: https://carto.com/blog/retail-revenue-prediction-data-science/. Area infrastructure, retailer reputation, market competition, inventory managements, customer type, sales strategy and a lot more factors impact revenue and makes this problem complex. further reading on some of these factors: https://smallbusiness.chron.com/calculate-percentage-profit-markups-business-60099.html


#### Is the data intuitiv?

As can be seen below, data is organized in rows, where each row represents a single retailer with its relevant attributes and revenue. The stores train, extra and test data is intuitiv.

The different grunnkrets data was not super intuitv before some exploration was done, the same grunnkrets_id appeard more than once, but we shortly realized that this is because the measurment (example: average income) is done twice, once in 2015 and once in 2016.

In [None]:
stores_train = pd.read_csv('data/stores_train.csv')



In [None]:
stores_train.head()

In [None]:
stores_train.info()

In [None]:
stores_train.describe()

In [None]:
#report = ProfileReport(stores_train)
#report

#### Explore individual features and pairs and groups

Explore revenue based on store type

In [None]:
len(stores_train.plaace_hierarchy_id.unique())

In [None]:
plaace_hierarchy = pd.read_csv('data/plaace_hierarchy.csv')
stores_with_hierarchy = stores_train.merge(plaace_hierarchy, how='left', on='plaace_hierarchy_id')

In [None]:
plt.figure(figsize=(20,6))
plt.gcf().set_dpi(600)
plt.xticks(rotation=90)
sns.violinplot(x='lv2_desc',y='revenue',data=stores_with_hierarchy).set_title("Revenue on store type")
plt.show()

further exploration of agencie store type

In [None]:
stores_with_hierarchy[stores_with_hierarchy["lv2_desc"]=="Agencies"]

Further exploration of "Food and drink" type stores


In [None]:
plt.figure(figsize=(10,5))
sns.violinplot(x='lv3_desc',y='revenue',data=stores_with_hierarchy[stores_with_hierarchy["lv2_desc"]=="Food and drinks"]).set_title("Food and drinks violin plot")
plt.xticks(rotation=90)
plt.show()

Explore retailers with NaN, 0 or negative revenue

In [None]:
stores_with_hierarchy[stores_with_hierarchy["revenue"]==0.0].describe()

In [None]:
stores_with_hierarchy[stores_with_hierarchy["revenue"] < 0.0].describe()

In [None]:
stores_with_hierarchy[stores_with_hierarchy["revenue"]== np.nan].describe()

All retailers and their cooresponding revenue, the plot is to visually check for outliers, clearly there are som outliers as can be seen in the long tail to the right of the major distribution.
The data is clearly positively skewed, confirmed by the skew number printed above the plot

In [None]:
rev_log = pd.DataFrame()
rev_log['revenue_log'] = np.log1p(stores_train['revenue'])

fig, (ax1, ax2) = plt.subplots(figsize=(15, 5), ncols=2, dpi=100)
sns.distplot(stores_train['revenue'], ax=ax1);
ax1.set_title('Distribution revenue');
sns.distplot(rev_log['revenue_log'], ax=ax2);
ax2.set_title('Distribution of revenue after log transform');

print(f"raw data skew: {stores_train['revenue'].skew()}")
print(f"log transform skew: {rev_log['revenue_log'].skew()}")

### Cleaning

#### Remove columns function - example: year is a const value and has no effect on the end result

In [None]:
def remove_columns(dataSet, columns):
    for column in columns:
        dataSet.drop(column, axis=1, inplace=True)


In [None]:
remove_columns(stores_train,['year'])
stores_train.head()

#### Remove retailers with 0 revenue function - might be handy

In [None]:
def remove_retailers_with_0_revenue(dataSet):
    dataSet.drop(dataSet[dataSet['revenue']==0.0].index, inplace=True)

#### Removing outliers

Plotting all retailers based on storetype before and after trimming to confirm that outliers actually has been removed

Below is before trimming

In [None]:
for store_type in stores_with_hierarchy['lv2_desc'].unique():
    plt.figure(figsize=(12,2))
    sns.violinplot(x='lv3_desc',y='revenue',data=stores_with_hierarchy[stores_with_hierarchy["lv2_desc"]==store_type]).set_title(f"{store_type} violin plot")
    plt.show()
    break #comment out for exploring more store types

Cap-outliers-function for the relationship between store type and revenue

In [None]:
def quantile_storeType_vs_revenue(stores, lower, upper):
    col_idx = stores.columns.get_loc('revenue')
    for store_type in stores['plaace_hierarchy_id'].unique():
        data = stores[stores['plaace_hierarchy_id']==store_type]
        upper_treshold = data['revenue'].quantile(upper)
        lower_treshold = data['revenue'].quantile(lower)
        #stores.drop(stores[(stores['plaace_hierarchy_id']==store_type) & (stores['revenue']>upper_treshold)].index, inplace=True)
        #stores.drop(stores[(stores['plaace_hierarchy_id']==store_type) & (stores['revenue']<lower_treshold)].index, inplace=True)
        
        stores.iloc[stores[(stores['plaace_hierarchy_id']==store_type) & (stores['revenue']>upper_treshold)].index,[col_idx]] = upper_treshold
        stores.iloc[stores[(stores['plaace_hierarchy_id']==store_type) & (stores['revenue']<lower_treshold)].index,[col_idx]] = lower_treshold
        
    

In [None]:
stores_train = pd.read_csv('data/stores_train.csv')
quantile_storeType_vs_revenue(stores_train,0.05,0.86)

Plot after removing outliers

you can see in the plot below that the outliers has been removed


In [None]:
plaace_hierarchy = pd.read_csv('data/plaace_hierarchy.csv')
stores_with_hierarchy = stores_train.merge(plaace_hierarchy, how='left', on='plaace_hierarchy_id')
for store_type in stores_with_hierarchy['lv2_desc'].unique():
    plt.figure(figsize=(12,2))
    sns.violinplot(x='lv3_desc',y='revenue',data=stores_with_hierarchy[stores_with_hierarchy["lv2_desc"]==store_type]).set_title(f"{store_type} violin plot")
    plt.show()
    break #comment out for exploring more store types

In [None]:
stores_train = pd.read_csv('data/stores_train.csv')
rev_log = pd.DataFrame()
rev_log['revenue_log'] = np.log1p(stores_train['revenue'])

rev_capped_log = stores_train.copy()
quantile_storeType_vs_revenue(rev_capped_log, 0.00, 0.90)
rev_capped_log['revenue'] = np.log1p(rev_capped_log['revenue'])

fig, (ax1, ax2) = plt.subplots(figsize=(15, 5), ncols=2, dpi=100)
sns.distplot(rev_log['revenue_log'], ax=ax1, bins=91);
ax1.set_title('Distribution log transformed revenue');
sns.distplot(rev_capped_log['revenue'], ax=ax2, bins=91);
ax2.set_title('Distribution capped log transform revenue');

Quantiling the data improved our predictions greatly in the beginning of the project, but we saw that it didnt have a noticable impact after we began exploring log transform, the plot above shows that quantiling does not improve distribution with log transform

#### comparing test set to training set

In [None]:
stores_train = pd.read_csv('data/stores_train.csv')
stores_test = pd.read_csv('data/stores_test.csv')

comparing coordinates

In [None]:
plt.figure(figsize=(16,9), dpi=600)
plt.scatter(stores_train['lon'],stores_train['lat'], label="traing",color='red')
plt.scatter(stores_test['lon'], stores_test['lat'], alpha=0.2, label="test", color="blue")
plt.legend(fontsize=10,ncol=2)
plt.xlabel("Latitude")
plt.ylabel("Longitude")
plt.grid()
plt.show()



In [None]:
fig = plt.figure(dpi=200)
ax1 = fig.add_subplot(projection='3d')
ax1.scatter(stores_train['lon'],stores_train['lat'],stores_train['revenue'])
ax1.set_xlabel('Lat')
ax1.set_ylabel('Lon')
ax1.set_zlabel('Revenue')
plt.show()


#### Examine whether a store occurs in multiple datasets 

In [None]:
def stores_that_are_in_both_sets(df1, df2):
    
    duplicate_set = pd.merge(df1,df2, how='inner', on='store_name')
    return duplicate_set

stores_train = pd.read_csv('data/stores_train.csv')
stores_test = pd.read_csv('data/stores_test.csv')
stores_extra = pd.read_csv('data/stores_extra.csv')

dup = stores_that_are_in_both_sets(stores_test, stores_train)
dup.head()

### Explore the other data sets

In [None]:
buss_stopps = pd.read_csv('data/busstops_norway.csv')
buss_stopps.head()

In [None]:
grunnkrets = pd.read_csv('data/grunnkrets_norway_stripped.csv')
grunnkrets.head()

In [None]:
gk_incomes = pd.read_csv('data/grunnkrets_income_households.csv')
gk_incomes.head()

In [None]:
gk_households = pd.read_csv('data/grunnkrets_households_num_persons.csv')
gk_households.head()

In [None]:
gk_ages = pd.read_csv('data/grunnkrets_age_distribution.csv')
gk_ages.head()

### Imbalance??

In [None]:
for col in stores_train:
    print(stores_train[col].value_counts())

# ___________ _2. Feature Engineering_ ___________

In [None]:
stores_train = pd.read_csv('data/stores_train.csv')
grunnkrets = pd.read_csv('data/grunnkrets_norway_stripped.csv')
gk_incomes = pd.read_csv('data/grunnkrets_income_households.csv')
gk_households = pd.read_csv('data/grunnkrets_households_num_persons.csv')
gk_ages = pd.read_csv('data/grunnkrets_age_distribution.csv')
buss_stopps = pd.read_csv('data/busstops_norway.csv')

grunnkrets.drop_duplicates(subset=['grunnkrets_id'], inplace=True)
gk_incomes.drop_duplicates(subset=['grunnkrets_id'], inplace=True)
gk_households.drop_duplicates(subset=['grunnkrets_id'], inplace=True)
gk_ages.drop_duplicates(subset=['grunnkrets_id'], inplace=True)
buss_stopps.drop_duplicates(subset=['busstop_id'], inplace=True)



Ages

In [None]:
# number of people in grunnkrets
gk_ages['tot_people'] = np.sum(gk_ages.iloc[:,np.arange(2,93,1)], axis=1)
gk_ages.head()

# people density (number of people divided by arekm2)
gk_area = grunnkrets[['grunnkrets_id','area_km2']]
gk_ages = pd.merge(gk_ages, gk_area, how='left', on='grunnkrets_id')
gk_ages['people_density'] = (gk_ages['tot_people'] / gk_ages['area_km2'])
gk_ages['people_density_log'] = np.log1p(gk_ages['tot_people'] / gk_ages['area_km2'])

gk_ages.head()



Households

In [None]:
# Number of house holds
gk_households['nb_households']  = np.sum(gk_households.iloc[:,np.arange(2,10,1)], axis=1)
gk_households.head()




In [None]:
stores_train.head()



Buss stops

In [None]:
# Lat long extraction
#buss_stopps.head()
#buss_stopps['geometry'].head()
'''
for row in buss_stoppdafsfass

'''
""" string = "POINT(10.7781327278563 59.9299988828761)"
#buss_stopps['geometry'] 
sliced = string[6:]
sliced = sliced[:]
print(sliced) """


## String fiksing
""" string2 = "POINT(10.7781327278563 59.9299988828761)"
string2 = string2[6:]
string2 = string2.replace(')','')
print(string2)
lon = string2.split()[0]
lat = string2.split()[1]
print(lat, lon)

lon = float(lon)
lat = float(lat)
print(type(lon), lon)
print(type(lat), lat) """




""" def addLatAndLong(string):
    string = string[6:]
    string = string.replace(')','')
    lon = string.split()[0]
    lat = string.split()[1]
    lon = float(lon)
    lat = float(lat)    

    return lat, lon """

#Funker bare med buss_stopps
def addLatAndLong(buss):
    #count = 0
    buss['lat'] = 0.0
    buss['lon'] = 0.0
    lonList = []
    latList = [] 
    for index, row in buss.iterrows():
        lon = row['geometry']
        lon = lon[6:]
        lon = lon.replace(')','')
        
        lat = lon.split()[1]
        lon = lon.split()[0]
        lon = float(lon)
        lat = float(lat) 

        lonList.append(lon)
        latList.append(lat)

        #buss[index]['lat'] = lat
        #buss[index]['lon'] = lon
        #row['lat'] = lat
        #row['lon'] = lon

        """  
        print(type(lon),lon, type(lat), lat )
        count +=1
        if count > 100:
            break 
        """
    buss['lon'] = np.array(lonList)
    buss['lat'] = np.array(latList)
    #print(lonList)
    #print(latList)

#addLatAndLong(buss_stopps)


#

#addLatAndLong(buss_stopps)
#print(str(buss_stopps['geometry']))

buss_stopps.tail()

In [None]:
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance in kilometers between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(mt.radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = mt.sin(dlat/2)**2 + mt.cos(lat1) * mt.cos(lat2) * mt.sin(dlon/2)**2
    c = 2 * mt.asin(mt.sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles. Determines return value units.
    return c * r

In [None]:
buss_stopps.head()

In [None]:

# Antall busstopp innafor x (km/m)
# Antall busstopp innafor y (km/m)

# Avstand til nærmeste busstopp
""" def distBetweenStoreAndBuss(stores, buss):

    quit = False

    for row in stores.iterrows():
        distlist = []
        quit = True
        lonS = float(stores['lon'])
        latS = float(stores['lat'])
        
        lonS = float(lonS)
        latS = float(latS) 
    
        
        for row in buss:
            lonB = float(buss['lon'])
            latB = float(buss['lat'])

            distance = haversine(lonS, latS, lonB, latB)
            print(distance)
            #distlist.append(haversine(lonS, latS, lonB, latB))
        if quit:
            return 
    
    print(distlist) """
def tester(dataframe):
    lon = dataframe['lon'].values[0]
    print(type(lon), lon)
    #for index, row in dataframe.iterrows():
    #    print(index, row['store_name'])

#tester(buss_stopps)

""" 
def distBetweenStoreAndBuss(stores, buss):

    #lonBuss = float(5.89980086113255)
    #latBuss = float(60.1421872817075)
    
    lonBuss = buss['lon'].values[0]
    latBuss = buss['lat'].values[0]

    stores['closestBusStop'] = np.zeros
    stores['busStopsWithin2km'] = np.zeros
    stores['busStopsWithin5km'] = np.zeros
    stores['busStopsWithin10km'] = np.zeros
    
    tens = []
    fives = []
    twos = []
    closestStops = []

    count = 0
    for index, row in stores.iterrows():
        # Antall busstopp innafor x (km/m)
        # Antall busstopp innafor y (km/m)

        # Avstand til nærmeste busstopp
        lonStore = float(row['lon'])
        latStore = float(row['lat'])

        


        closestStop = np.Inf
        print("current iteration: ", count)
        for index, rad in buss.iterrows():
            lonBuss = float(rad['lon'])
            latBuss = float(rad['lat'])

            dist = haversine(lonStore, latStore, lonBuss, latBuss)

            ten = 0
            five = 0
            two = 0
            if dist < float(10):
                ten += 1
                if dist < float(5):
                    five += 1
                    if dist < float(2):
                        two += 1
    
            if closestStop > dist:
                closestStop = dist
        count += 1
        tens.append(ten)
        fives.append(five)
        twos.append(two)
        closestStops.append(closestStop)
        #break

    
    stores['closestBusStop'] = np.array(closestStops)
    stores['busStopsWithin2km'] = np.array(twos)
    stores['busStopsWithin5km'] = np.array(fives)
    stores['busStopsWithin10km'] = np.array(tens)

    
"""  

def distBetweenStoreAndBuss(stores, buss):

    #lonBuss = float(5.89980086113255)
    #latBuss = float(60.1421872817075)
    
    lonBuss = buss['lon'].values[0]
    latBuss = buss['lat'].values[0]
    
    """ 
    stores['closestBusStop'] = np.zeros
    stores['busStopsWithin2km'] = np.zeros
    stores['busStopsWithin5km'] = np.zeros
    """
    stores['busStopsWithin10km'] = np.zeros
    
    tens = []
    fives = []
    twos = []
    closestStops = []

    count = 0
    for index, row in stores.iterrows():
        # Antall busstopp innafor x (km/m)
        # Antall busstopp innafor y (km/m)

        # Avstand til nærmeste busstopp
        lonStore = float(row['lon'])
        latStore = float(row['lat'])

        


        closestStop = np.Inf
        print("current iteration: ", count)
        for index, rad in buss.iterrows():
            lonBuss = float(rad['lon'])
            latBuss = float(rad['lat'])

            dist = haversine(lonStore, latStore, lonBuss, latBuss)

            ten = 0
            five = 0
            two = 0
            if dist < float(10):
                ten += 1
            """    
                if dist < float(5):
                    five += 1
                    if dist < float(2):
                        two += 1
    
            if closestStop > dist:
                closestStop = dist 
            """
        count += 1
        tens.append(ten)
        """ 
        fives.append(five)
        twos.append(two)
        closestStops.append(closestStop) 
        """
        #break

    """ 
    stores['closestBusStop'] = np.array(closestStops)
    stores['busStopsWithin2km'] = np.array(twos)
    stores['busStopsWithin5km'] = np.array(fives)
    """
    stores['busStopsWithin10km'] = np.array(tens)

    """ 
    stores['closestBusStop'] = np.array(closestStop)
    buss['lat'] = np.array(latList) 
    """

distBetweenStoreAndBuss(stores_train, buss_stopps)
      
#        for row in buss:
#            lonB = float(buss['lon'])
#            latB = float(buss['lat'])
#            distance = haversine(lonS, latS, lonB, latB)
#            print(distance)
#            #distlist.append(haversine(lonS, latS, lonB, latB))
#        if quit:
#            return 
    
#    pr
# int(distlist)





""" 
#Buss
latBuss = 60.1421872817075
lonBuss = 5.89980086113255

#store
latStore = 59.9137594158249
lonStore = 10.7340307646896

print(haversine(latBuss, lonBuss, latStore, lonStore))


stores_train.head() 
"""




In [None]:
def dist(X, Y):
    sx = np.sum(X**2, keepdims=True)
    #sx = X*X.T
    sy = np.sum(Y**2, axis=1, keepdims=True)
    return np.sqrt(-2 * X.dot(Y.T) + sx + sy.T)

busMatrix = buss_stopps[['lat', 'lon']]
busMatrix = busMatrix.to_numpy()

storeMatrix = stores_train[['lat', 'lon']]
storeMatrix = storeMatrix.to_numpy()

#storeMatrix = pd.DataFrame(stores_train[['lat', 'lon']])


for row in storeMatrix: 
    #print(row)
    result = dist(row, busMatrix)

    #print(type(result))
    #result = np.array(result)
    #print(result.shape)
    result = result.flatten() 

    print(result[result<0.0015])

    #print(min(result))
    


#busMatrix = busMatrix.T


""" print(storeMatrix.shape)
print(busMatrix.shape) """






#newMatrix = dist(storeMatrix, busMatrix)






Train set

In [None]:
tikmfil = pd.DataFrame()
tikmfil['id'] = test_data['store_id']
#submission['busStopsWithin10km'] = np.asarray(pred_data)
tikmfil['busStopsWithin10km'] = stores_train['busStopsWithin10km']
tikmfil.to_csv('submissionFiles\\BusstoppTikmFil.csv', index=False)
tikmfil

In [None]:
def self_aggregate_columns (stores):
    # has mall

    # has chain

    # Distance to another store

    # Distance to another store of same type

    # Density of stores in grunnkrets
    #gk_area = grunnkrets[['grunnkrets_id','area_km2']]
    #st_dens = pd.merge(gk_ages, gk_area, how='left', on='grunnkrets_id')

    # lat lon log transform
    stores['lat_log'] = np.log1p(stores['lat'])
    stores['lon_log'] = np.log1p(stores['lon'])

Concat columns

In [None]:
def add_selected_columns(df):
    self_aggregate_columns(df)
    gk = grunnkrets[['grunnkrets_id','municipality_name']]
    gk_i = gk_incomes[['grunnkrets_id','all_households']] #all house holds = median income
    gk_h = gk_households[['grunnkrets_id','nb_households']]
    gk_a = gk_ages[['grunnkrets_id','tot_people','people_density_log']]
    
    concat = pd.merge(df, gk, how='left', on='grunnkrets_id')
    concat = pd.merge(concat, gk_i, how='left', on='grunnkrets_id')
    concat = pd.merge(concat, gk_h, how='left', on='grunnkrets_id')
    concat = pd.merge(concat, gk_a, how='left', on='grunnkrets_id')
    
    return concat

In [None]:
stores_train = pd.read_csv('data/stores_train.csv')
remove_columns(stores_train, ['store_id','store_name','year','address','sales_channel_name'])
stores_train = add_selected_columns(stores_train)
sns.heatmap(stores_train.corr(), annot=True)
plt.show()

# ___________ _3. Machine Learning Models and Predictions_ ___________


### Helper functions

In [None]:
def convert_DType_LGBM(dFrame):
    le = LabelEncoder()
    X = pd.DataFrame()
    
    for col_name in dFrame:
        if dFrame[col_name].dtypes == 'object':
            X[col_name] = dFrame[col_name].astype('category')
            
        #elif col_name == 'grunnkrets_id':
        #    X[col_name] = le.fit_transform(dFrame[col_name])
        
        else:
            X[col_name] = dFrame[col_name]
    
    return X

In [None]:
def convert_DType_CatBoost(dFrame):
    le = LabelEncoder()
    X = pd.DataFrame()
    for col_name in dFrame:
        
        if col_name == 'grunnkrets_id':# or col_name == 'plaace_hierarchy_id':
            #X[col_name] = dFrame[col_name]
            X[col_name] = le.fit_transform(dFrame[col_name])
            #X[col_name] = dFrame[col_name].astype(str)
        
        elif dFrame[col_name].dtypes == 'object':
            X[col_name] = dFrame[col_name].astype(str)
            
        else:
            X[col_name] = dFrame[col_name]
    
    return X

Load train data and divide into test and train

In [None]:
def get_data(test_size=0.20):
    stores_train = pd.read_csv('data/stores_train.csv')
    stores_test = pd.read_csv('data/stores_test.csv')

    # select prefered columns
    remove_columns(stores_train, ['store_id','store_name','year','address','sales_channel_name'])
    remove_columns( stores_test, ['store_id','store_name','year','address','sales_channel_name'])

    # Add features
    #print(stores_train.shape) 
    stores_train = add_selected_columns(stores_train)
    stores_test = add_selected_columns(stores_test)
    #print(stores_train.shape)

    # Preprocess/Clean data
    #quantile_storeType_vs_revenue(stores_train,0.01, 0.88)

    # Divide data into train and test set
    x_train = stores_train.drop('revenue', axis=1)

    y_train = stores_train['revenue']
    y_train=np.log1p(y_train) #log transform revenue

    x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=test_size, random_state=3)
    
    return  x_train, x_test, y_train, y_test, stores_test


In [None]:
x_train_check, _, _, _, _ = get_data()
x_train_check.dtypes

#### LightGBM


In [None]:
# Load data
LGBM_x_train, LGBM_x_test, y_train, y_test, _ = get_data()

# Convert to approperiate dtypes
LGBM_x_train = convert_DType_LGBM(LGBM_x_train)
LGBM_x_test = convert_DType_LGBM(LGBM_x_test)



# Make model, fit and predict
parameters = {# Params obtained trough testing and reading up on this guide: https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5
              #'metric': 'acc',
              #'n_estimators' : 400
              #'path_smooth' : 0.5,
              #'min_data_in_leaf' : 3
}

LGBM_model = LGBMRegressor(**parameters)
LGBM_model.fit(LGBM_x_train, y_train)
LGBM_pred = LGBM_model.predict(LGBM_x_test)
LGBM_pred=np.expm1(LGBM_pred) #invert log transform

# Run some tests
number_of_negatives = 0
for i in range(len(LGBM_pred)):
    if LGBM_pred[i] < 0.0:
        number_of_negatives += 1
        LGBM_pred[i] = 0.0

print(f"number of negatives: {number_of_negatives}")
print(f"rmsle: {rmsle(y_test,LGBM_pred)}")

In [None]:
# Previous rmse scores gave the following kaggle scores:
# - 0.9055645241057166 rmsle resulted in: 0.71576 on kaggle - LGBM a lot of columns)

#### Catboost

In [None]:
# Load data
CB_x_train, CB_x_test, y_train, y_test, _ = get_data()

# Convert to approperiate dtypes
CB_x_train = convert_DType_CatBoost(CB_x_train)
CB_x_test = convert_DType_CatBoost(CB_x_test)
categorical_features_indices = np.where((CB_x_train.dtypes != np.float))[0]

print(CB_x_train.dtypes)

# Make model, fit and predict
parameters = {
    #some param
}

CB_model = cb.CatBoostRegressor(loss_function='RMSE', **parameters)
CB_model.fit(CB_x_train,y_train, cat_features=categorical_features_indices)
CB_pred = CB_model.predict(CB_x_test)
CB_pred = np.expm1(CB_pred)


# Run some tests
number_of_negatives = 0
for i in range(len(CB_pred)):
    if CB_pred[i] < 0.0:
        number_of_negatives += 1
        CB_pred[i] = 0.0

print(f"number of negatives: {number_of_negatives}")
print(f"rmsle: {rmsle(y_test,CB_pred)}")


compare models

In [None]:
compare_preds = pd.DataFrame()
compare_preds['true'] = y_test
compare_preds['mean'] = (CB_pred+LGBM_pred)/2
compare_preds['catboost'] = CB_pred
compare_preds['CB err'] = np.abs(CB_pred - y_test)
compare_preds['lightgbm'] = LGBM_pred
compare_preds['LGBM err'] = np.abs(LGBM_pred - y_test)

print(f" CB err sum: {np.sum(compare_preds['CB err'])}")
print(f" CB err mean: {np.mean(compare_preds['CB err'])}")
print(f" LGBM err sum: {np.sum(compare_preds['LGBM err'])}")
print(f" LGBM err mean: {np.mean(compare_preds['LGBM err'])}")


#compare_preds.head(50)

## Predict test and submit

In [None]:
# Load data
x_train, _, y_train, _, test = get_data(test_size=0.01)

# Convert to approperiate dtypes
LGBM_x_train = convert_DType_LGBM(x_train)
LGBM_test = convert_DType_LGBM(test)

CB_x_train = convert_DType_CatBoost(x_train)
CB_test = convert_DType_CatBoost(test)
categorical_features_indices = np.where((CB_x_train.dtypes != np.float))[0]

# LGBM
LGBM_model = LGBMRegressor(**parameters)
LGBM_model.fit(LGBM_x_train, y_train)
LGBM_pred = LGBM_model.predict(LGBM_test)
LGBM_pred=np.expm1(LGBM_pred) #invert log transform

# Catboost
CB_model = cb.CatBoostRegressor(loss_function='RMSE', **parameters, silent=True)
CB_model.fit(CB_x_train,y_train, cat_features=categorical_features_indices)
CB_pred = CB_model.predict(CB_test)
CB_pred = np.expm1(CB_pred)

# Aggregate result
PREDICTION = LGBM_pred

In [None]:

#write the predicition to file
writeResultToFile(stores_test, PREDICTION, "LGMB")

# Verify format of submission file
submissionVery = pd.read_csv('submissionFiles/LGMB.csv')
submissionVery.info()

## Old stuff

In [None]:
stores_train = pd.read_csv('data/stores_train.csv')
stores_test = pd.read_csv('data/stores_test.csv')
test = stores_test.copy()

# select prefered columns
remove_columns(stores_train, ['store_id','store_name','year','address','sales_channel_name'])
remove_columns(test, ['store_id','store_name','year','address','sales_channel_name'])

# Add features
print(stores_train.shape, test.shape)
stores_train = add_selected_columns(stores_train)
test = add_selected_columns(test)
print(stores_train.shape, test.shape)

# Preprocess/Clean data
#quantile_storeType_vs_revenue(stores_train,0.01, 0.88)

# Divide data into train and test set
x_train = stores_train.drop('revenue', axis=1)
y_train = stores_train['revenue']
y_train=np.log1p(y_train)



print(x_train.shape, test.shape)

In [None]:
test.head()

In [None]:
lgbm_x_train = convert_DType_LGBM(x_train)
lgbm_test = convert_DType_LGBM(test)
# Model, fit and predict
LGBM_model =LGBMRegressor(**parameters)
LGBM_model.fit(lgbm_x_train, y_train)
submission_pred = LGBM_model.predict(lgbm_test)
submission_pred=np.expm1(submission_pred)
# remove negative values
number_of_negatives = 0
for i in range(len(submission_pred)):
    if submission_pred[i] < 0.0:
        number_of_negatives += 1
        submission_pred[i] = 0.0
print(f"number of negatives: {number_of_negatives}")

In [None]:
#write the predicition to file
writeResultToFile(stores_test, submission_pred, "LGBM_plaace_hierarchy_id_grunnkrets_id_lat_lon_chain_name_mall_name_lat_log_lon_log_municipality_name_all_households_nb_households_tot_people_people_density_log")

# Verify format of submission file
submissionVery = pd.read_csv('submissionFiles/LGBM_plaace_hierarchy_id_grunnkrets_id_lat_lon_chain_name_mall_name_lat_log_lon_log_municipality_name_all_households_nb_households_tot_people_people_density_log.csv')
submissionVery.info()

Catboost pred and submitt

In [None]:
stores_train = pd.read_csv('data/stores_train.csv')
stores_test = pd.read_csv('data/stores_test.csv')
test = stores_test.copy()

# select prefered columns
remove_columns(stores_train, ['store_id','store_name','year','address','sales_channel_name'])
remove_columns(test, ['store_id','store_name','year','address','sales_channel_name'])

# Add features
print(stores_train.shape, test.shape)
stores_train = add_selected_columns(stores_train)
test = add_selected_columns(test)
print(stores_train.shape, test.shape)

# Preprocess/Clean data
#quantile_storeType_vs_revenue(stores_train,0.01, 0.88)

# Divide data into train and test set
x_train = stores_train.drop('revenue', axis=1)
y_train = stores_train['revenue']
y_train=np.log1p(y_train)

In [None]:
x_train.head()

In [None]:
cb_x_train = x_train
categorical_features_indices = np.where((cb_x_train.dtypes != np.float))[0]
cb_x_train = convert_DType_CatBoost(cb_x_train)

cb_x_test = test
cb_x_test = convert_DType_CatBoost(cb_x_test)

CB_model = cb.CatBoostRegressor(loss_function='RMSE')
grid = {'iterations': [100, 150, 200],
        'learning_rate': [0.03, 0.1],
        'depth': [2, 4, 6, 8],
        'l2_leaf_reg': [0.2, 0.5, 1, 3]}

CB_model.fit(cb_x_train,y_train, cat_features=categorical_features_indices)
CB_pred = CB_model.predict(cb_x_test)



In [None]:
#write the predicition to file
writeResultToFile(stores_test, CB_pred, "Cat_first")

# Verify format of submission file
submissionVery = pd.read_csv('submissionFiles/Cat_first.csv')
submissionVery.info()

## _____ Random Forest Regressor _____

Load, preprocess and convert data to correct format

In [None]:
# Load training and test data
stores_train = pd.read_csv('data/stores_train.csv')
stores_test = pd.read_csv('data/stores_test.csv')

# Preprocess/Clean data
remove_columns(stores_train, ['store_id','year','store_name','sales_channel_name','address','chain_name','mall_name'])
remove_columns(stores_test, ['store_id','year','store_name','sales_channel_name','address','chain_name','mall_name'])
#remove_retailers_with_0_revenue(stores_train)
quantile_storeType_vs_revenue(stores_train,0.10, 0.80)

# Divide data into x and y train
x_train = stores_train.drop('revenue', axis=1)
y_train = stores_train['revenue']
x_test = stores_test.copy()

# Convert from object type to numerical
#train set
cat_columns = x_train.select_dtypes(['object']).columns
x_train[cat_columns] = x_train[cat_columns].apply(lambda x: pd.factorize(x)[0])
#test set
cat_columns = x_test.select_dtypes(['object']).columns
x_test[cat_columns] = x_test[cat_columns].apply(lambda x: pd.factorize(x)[0])



In [None]:
stores_train.head()

Train model

In [None]:
# Model
RFR = RandomForestRegressor(n_estimators=100)

# Fitting
RFR.fit(x_train, y_train)


Test RFR model

In [None]:
# predicting the training data set as a pin pointer
pred_train_RFR = RFR.predict(x_train)
print(rmsle(y_train, pred_train_RFR))
print(RFR.score(x_train, y_train))

Predict test and submit

In [None]:
pred_test_RFR = RFR.predict(x_test)

In [None]:
# Write to file
#writeResultToFile(stores_test, pred_test_RFR, "RFR_10_80_percentile")

# Verify format of submission file
#submissionVery = pd.read_csv('submissionFiles/RFR_10_80_percentile.csv')
#submissionVery.info()

## Emil modeller

### model 1


In [None]:
# pythons stuff emil

# 4. Model Interpretations

In [None]:
cat_columns = x_train.select_dtypes(['category']).columns
x_train[cat_columns] = x_train[cat_columns].apply(lambda x: pd.factorize(x)[0])

# Tune
tuner = LGBMTuner(metric = 'rmsle', verbosity=0)
tuner.fit(x_train, y_train)



### Lime

In [None]:
#lime stuff in python

### Feature importance

In [None]:
#feature importance

tuner.plot_importances()


### PDP

In [None]:
#PDP

# 5. Final improved models/predictions

### model 1

In [None]:
#final model 1

### model 2

In [None]:
#final model 2

# Testing


#### RMSLE

In [None]:
def rmsle(y_true, y_pred):
    return metrics.mean_squared_log_error(y_true, y_pred)**0.5
