## <font color= 'green'>Data Description:</font>

### <font color='#FF4233'>traindata_with_target.csv:</font>
**date:** The timestamp at which the yield of the food processing farm was measured.

**farm_id:** The farm identifier that recognizes the farm food processing plant.

**ingredient_type:** The type of ingredient being produced.

**yield:** The yield of the plant in tonnes.

### <font color='#FF4233'>farm_data.csv:</font>
**farm_id:** The farm identifier that recognizes the farm food processing plant.

**founding_year:** The year when the operations commenced on the farm and food processing plant.

**num_processing_plants:** The number of processing plants present on the farm.

**farm_area:** The area of the farm in square meters.

**farming_company:** The company that owns the farms.

**deidentified_location:** The location at which the farm is present.

### <font color = '#FF4233'>train_weather.csv:</font>
**timestamp:** The time at which weather readings were taken.	

**deidentified_location:** The location at which weather readings were taken.

**temp_obs:** Temperature measured (units look like Celsius)	

**cloudiness:**  The state of the sky when it is covered by clouds (units <font color='#FF33F5'>Okta which means 8</font>).

(To understand what each value means, go to the link https://en.wikipedia.org/wiki/Okta)

**wind_direction:** The direction from which the wind originates. Wind direction is usually reported in degrees. Consequently, a wind blowing from the north has a wind direction referred to as 0° (360°); a wind blowing from the east has a wind direction referred to as 90°, etc.	

**dew_temp:** If the temperature of the air is reduced at constant pressure, at some particular temperature, the water vapour in the air starts to condense. This temperature is called dew point temperature. (units looks like Celsius)

**pressure_sea_level:** The sea level pressure is the atmospheric pressure at sea level at a given location. It is an indicator of weather. When a low-pressure system moves into an area, it usually leads to cloudiness, wind, and precipitation. High-pressure systems usually lead to fair, calm weather. 	

**precipitation:** The amount of rainfall forcasted.(in mm)

<u><font color= 'red'>Note:</font></u> When a particular area experiences more atmospheric evaporation than water downpour over a period of time, during that time, <font color='#FF33F5'>precipitation can be negative</font>.


**wind_speed:** The rate at which air is moving in a particular area. (can be in mph or kmph)

In [1]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Reading the csv files

train = pd.read_csv('C://Users//akhil//Desktop//INSOFE//Final_Hackathon/Datasets/train_data.csv')
farm = pd.read_csv('C://Users//akhil//Desktop//INSOFE//Final_Hackathon/Datasets/farm_data.csv')
train_weather = pd.read_csv('C://Users//akhil//Desktop//INSOFE//Final_Hackathon/Datasets/train_weather.csv')

### Basic understanding of the data

In [3]:
print(f'Train data shape: {train.shape}')
print(f'Farm data shape: {farm.shape}')
print(f'Train weather data shape: {train_weather.shape}')

Train data shape: (20216100, 4)
Farm data shape: (1449, 6)
Train weather data shape: (139773, 9)


In [4]:
train.head()

Unnamed: 0,date,farm_id,ingredient_type,yield
0,2016-01-01 00:00:00,fid_110884,ing_w,0.0
1,2016-01-01 00:00:00,fid_90053,ing_w,0.0
2,2016-01-01 00:00:00,fid_17537,ing_w,0.0
3,2016-01-01 00:00:00,fid_110392,ing_w,0.0
4,2016-01-01 00:00:00,fid_62402,ing_w,0.0


In [5]:
farm.head()

Unnamed: 0,farm_id,operations_commencing_year,num_processing_plants,farm_area,farming_company,deidentified_location
0,fid_110884,2008.0,,690.455096,Obery Farms,location 7369
1,fid_90053,2004.0,,252.69616,Obery Farms,location 7369
2,fid_17537,1991.0,,499.446528,Obery Farms,location 7369
3,fid_110392,2002.0,,2200.407555,Obery Farms,location 7369
4,fid_62402,1975.0,,10833.14012,Obery Farms,location 7369


In [6]:
train_weather.head()

Unnamed: 0,timestamp,deidentified_location,temp_obs,cloudiness,wind_direction,dew_temp,pressure_sea_level,precipitation,wind_speed
0,2016-01-01 00:00:00,location 7369,25.0,6.0,0.0,20.0,1019.7,,0.0
1,2016-01-01 01:00:00,location 7369,24.4,,70.0,21.1,1020.2,-1.0,1.5
2,2016-01-01 02:00:00,location 7369,22.8,2.0,0.0,21.1,1020.2,0.0,0.0
3,2016-01-01 03:00:00,location 7369,21.1,2.0,0.0,20.6,1020.1,0.0,0.0
4,2016-01-01 04:00:00,location 7369,20.0,2.0,250.0,20.0,1020.0,-1.0,2.6


In [7]:
print(f'Train data shape: {train.shape}')
print(f'Farm data shape: {farm.shape}')
print(f'Train weather data shape: {train_weather.shape}')

Train data shape: (20216100, 4)
Farm data shape: (1449, 6)
Train weather data shape: (139773, 9)


### Checking and removing duplicates
* To save computational power and also to reduce overfit or underfit

In [8]:
train.rename(columns= {'date': 'timestamp'}, inplace=True)

In [9]:
train[train.duplicated()]

Unnamed: 0,timestamp,farm_id,ingredient_type,yield
1361753,2016-01-25 16:00:00,fid_63700,ing_w,0.0000
1361754,2016-01-25 16:00:00,fid_63700,ing_x,0.0000
1543730,2016-01-29 00:00:00,fid_68761,ing_w,21.0600
7200211,2016-05-14 01:00:00,fid_71910,ing_w,27.8000
8655017,2016-06-08 22:00:00,fid_68761,ing_x,28.3458
...,...,...,...,...
15973752,2016-10-17 06:00:00,fid_63700,ing_w,0.0000
15973753,2016-10-17 06:00:00,fid_63700,ing_x,0.0000
17426580,2016-11-12 02:00:00,fid_71910,ing_w,30.3000
17506726,2016-11-13 13:00:00,fid_71910,ing_w,29.1000


In [10]:
train.drop_duplicates(keep = 'first', inplace=True)

In [11]:
train[train.duplicated()]

Unnamed: 0,timestamp,farm_id,ingredient_type,yield


In [12]:
farm[farm.duplicated()]

Unnamed: 0,farm_id,operations_commencing_year,num_processing_plants,farm_area,farming_company,deidentified_location


In [13]:
train_weather[train_weather.duplicated()]

Unnamed: 0,timestamp,deidentified_location,temp_obs,cloudiness,wind_direction,dew_temp,pressure_sea_level,precipitation,wind_speed


## Merging the datasets
* We need to merge the datasets to understand the details of the farm_id, weather and also the actual mappings of the details based on farm_id, timestamp and deidentified_location.

In [14]:
print(f'Train data shape: {train.shape}')
print(f'Farm data shape: {farm.shape}')
print(f'Train weather data shape: {train_weather.shape}')

Train data shape: (20215983, 4)
Farm data shape: (1449, 6)
Train weather data shape: (139773, 9)


In [15]:
common_cols = []
for x in train.columns:
    if x in farm.columns:
        common_cols.append(x)
        
print(common_cols)

['farm_id']


In [16]:
new_train_df = pd.merge(train,farm, on= common_cols, how = 'inner')

In [17]:
new_train_df.shape

(20602665, 9)

In [18]:
new_train_df.head()

Unnamed: 0,timestamp,farm_id,ingredient_type,yield,operations_commencing_year,num_processing_plants,farm_area,farming_company,deidentified_location
0,2016-01-01 00:00:00,fid_110884,ing_w,0.0,2008.0,,690.455096,Obery Farms,location 7369
1,2016-01-01 01:00:00,fid_110884,ing_w,0.0,2008.0,,690.455096,Obery Farms,location 7369
2,2016-01-01 02:00:00,fid_110884,ing_w,0.0,2008.0,,690.455096,Obery Farms,location 7369
3,2016-01-01 03:00:00,fid_110884,ing_w,0.0,2008.0,,690.455096,Obery Farms,location 7369
4,2016-01-01 04:00:00,fid_110884,ing_w,0.0,2008.0,,690.455096,Obery Farms,location 7369


In [19]:
new_train_df[new_train_df.duplicated()]

Unnamed: 0,timestamp,farm_id,ingredient_type,yield,operations_commencing_year,num_processing_plants,farm_area,farming_company,deidentified_location


In [20]:
common_cols = []
for x in new_train_df.columns:
    if x in train_weather.columns:
        common_cols.append(x)
        
print(common_cols)

['timestamp', 'deidentified_location']


In [21]:
train_df = pd.merge(new_train_df,train_weather, on=common_cols, how = 'inner')

In [22]:
train_df.shape

(20511298, 16)

In [23]:
train_df.isna().mean()*100

timestamp                      0.000000
farm_id                        0.000000
ingredient_type                0.000000
yield                          0.000000
operations_commencing_year    60.142732
num_processing_plants         82.467380
farm_area                      0.000000
farming_company                0.000000
deidentified_location          0.000000
temp_obs                       0.030432
cloudiness                    43.414230
wind_direction                 6.709649
dew_temp                       0.047598
pressure_sea_level             5.786630
precipitation                 18.185568
wind_speed                     0.261651
dtype: float64

In [24]:
train_df['timestamp'] = pd.to_datetime(train_df['timestamp'])
train_df = train_df.sort_values(['timestamp'])

In [25]:
train_df.drop(['operations_commencing_year', 'num_processing_plants'], axis = 1, inplace = True)

In [26]:
train_df.isna().mean()*100

timestamp                 0.000000
farm_id                   0.000000
ingredient_type           0.000000
yield                     0.000000
farm_area                 0.000000
farming_company           0.000000
deidentified_location     0.000000
temp_obs                  0.030432
cloudiness               43.414230
wind_direction            6.709649
dew_temp                  0.047598
pressure_sea_level        5.786630
precipitation            18.185568
wind_speed                0.261651
dtype: float64

In [27]:
train_df.columns

Index(['timestamp', 'farm_id', 'ingredient_type', 'yield', 'farm_area',
       'farming_company', 'deidentified_location', 'temp_obs', 'cloudiness',
       'wind_direction', 'dew_temp', 'pressure_sea_level', 'precipitation',
       'wind_speed'],
      dtype='object')

In [28]:
train_df.dtypes

timestamp                datetime64[ns]
farm_id                          object
ingredient_type                  object
yield                           float64
farm_area                       float64
farming_company                  object
deidentified_location            object
temp_obs                        float64
cloudiness                      float64
wind_direction                  float64
dew_temp                        float64
pressure_sea_level              float64
precipitation                   float64
wind_speed                      float64
dtype: object

In [29]:
train_df['cloudiness'].value_counts().idxmin()

5.0

In [30]:
train_df[['farm_id', 'ingredient_type', 'farming_company', 'deidentified_location']] = train_df[['farm_id', 'ingredient_type', 'farming_company', 'deidentified_location']].astype('category')

In [31]:
float_cols = train_df.select_dtypes('float64').columns
train_df[float_cols] = train_df[float_cols].astype('float32')

In [32]:
train_df.dtypes

timestamp                datetime64[ns]
farm_id                        category
ingredient_type                category
yield                           float32
farm_area                       float32
farming_company                category
deidentified_location          category
temp_obs                        float32
cloudiness                      float32
wind_direction                  float32
dew_temp                        float32
pressure_sea_level              float32
precipitation                   float32
wind_speed                      float32
dtype: object

In [33]:
train_df.head()

Unnamed: 0,timestamp,farm_id,ingredient_type,yield,farm_area,farming_company,deidentified_location,temp_obs,cloudiness,wind_direction,dew_temp,pressure_sea_level,precipitation,wind_speed
0,2016-01-01,fid_110884,ing_w,0.0,690.455078,Obery Farms,location 7369,25.0,6.0,0.0,20.0,1019.700012,,0.0
7828412,2016-01-01,fid_92885,ing_w,95.639999,9188.106445,Obery Farms,location 5290,10.0,8.0,350.0,2.2,1021.099976,,4.1
7828411,2016-01-01,fid_97108,ing_w,214.639999,7900.378418,Obery Farms,location 5290,10.0,8.0,350.0,2.2,1021.099976,,4.1
7828410,2016-01-01,fid_52171,ing_w,57.77,6466.048828,Obery Farms,location 5290,10.0,8.0,350.0,2.2,1021.099976,,4.1
7828409,2016-01-01,fid_120541,ing_w,27.84,6187.339844,Obery Farms,location 5290,10.0,8.0,350.0,2.2,1021.099976,,4.1


In [34]:
float_nulls = list(train_df.columns[7:])
float_nulls

['temp_obs',
 'cloudiness',
 'wind_direction',
 'dew_temp',
 'pressure_sea_level',
 'precipitation',
 'wind_speed']

### Imputing the null values based on the location.
* Weather changes from one location to the other, either marginally or largely
* Imputing them in this way can help us to represent the data in a better fashion

In [35]:
%%time

imputed_values = pd.DataFrame(columns= ['deidentified_location', 'imputed_column', 'imputed_value'])

# Atotal of 16*7 = 112 iterations will run in this loop
count = 1
for x in train_df['deidentified_location'].unique():
    for z in float_nulls:
            re = train_df[(train_df['deidentified_location'] == x)][z].mean()
            
            train_df.loc[((train_df['deidentified_location'] == x)), z]=\
            train_df.loc[((train_df['deidentified_location'] == x)), z].fillna(re)
            
            imputed_values = imputed_values.append(pd.Series([x, z, re],index= imputed_values.columns), ignore_index=True)
            print(f'iter: {count}')
            count += 1

iter: 1
iter: 2
iter: 3
iter: 4
iter: 5
iter: 6
iter: 7
iter: 8
iter: 9
iter: 10
iter: 11
iter: 12
iter: 13
iter: 14
iter: 15
iter: 16
iter: 17
iter: 18
iter: 19
iter: 20
iter: 21
iter: 22
iter: 23
iter: 24
iter: 25
iter: 26
iter: 27
iter: 28
iter: 29
iter: 30
iter: 31
iter: 32
iter: 33
iter: 34
iter: 35
iter: 36
iter: 37
iter: 38
iter: 39
iter: 40
iter: 41
iter: 42
iter: 43
iter: 44
iter: 45
iter: 46
iter: 47
iter: 48
iter: 49
iter: 50
iter: 51
iter: 52
iter: 53
iter: 54
iter: 55
iter: 56
iter: 57
iter: 58
iter: 59
iter: 60
iter: 61
iter: 62
iter: 63
iter: 64
iter: 65
iter: 66
iter: 67
iter: 68
iter: 69
iter: 70
iter: 71
iter: 72
iter: 73
iter: 74
iter: 75
iter: 76
iter: 77
iter: 78
iter: 79
iter: 80
iter: 81
iter: 82
iter: 83
iter: 84
iter: 85
iter: 86
iter: 87
iter: 88
iter: 89
iter: 90
iter: 91
iter: 92
iter: 93
iter: 94
iter: 95
iter: 96
iter: 97
iter: 98
iter: 99
iter: 100
iter: 101
iter: 102
iter: 103
iter: 104
iter: 105
iter: 106
iter: 107
iter: 108
iter: 109
iter: 110
iter: 11

In [36]:
imputed_values

Unnamed: 0,deidentified_location,imputed_column,imputed_value
0,location 7369,temp_obs,23.054689
1,location 7369,cloudiness,3.048423
2,location 7369,wind_direction,155.227356
3,location 7369,dew_temp,17.055305
4,location 7369,pressure_sea_level,1022.635681
...,...,...,...
107,location 5150,wind_direction,182.995102
108,location 5150,dew_temp,5.496694
109,location 5150,pressure_sea_level,1021.985046
110,location 5150,precipitation,3.762756


In [37]:
train_df[(train_df['deidentified_location'] == 'location 868')]['cloudiness']

20394039   NaN
20394040   NaN
20394042   NaN
20394049   NaN
20394048   NaN
            ..
20511192   NaN
20511191   NaN
20511190   NaN
20511189   NaN
20511188   NaN
Name: cloudiness, Length: 117259, dtype: float32

In [38]:
print(train_df[train_df['cloudiness'].isna()]['deidentified_location'].value_counts())
print('\n')
print(train_df[train_df['pressure_sea_level'].isna()]['deidentified_location'].value_counts())
print('\n')
print(train_df[train_df['precipitation'].isna()]['deidentified_location'].value_counts())
print('\n')

location 4525    359642
location 868     117259
location 959          0
location 8421         0
location 7369         0
location 7048         0
location 6364         0
location 5833         0
location 5677         0
location 565          0
location 5489         0
location 5410         0
location 5290         0
location 5150         0
location 2532         0
location 1784         0
Name: deidentified_location, dtype: int64


location 6364    819731
location 959          0
location 868          0
location 8421         0
location 7369         0
location 7048         0
location 5833         0
location 5677         0
location 565          0
location 5489         0
location 5410         0
location 5290         0
location 5150         0
location 4525         0
location 2532         0
location 1784         0
Name: deidentified_location, dtype: int64


location 6364    819731
location 959     552034
location 7048    323623
location 868          0
location 8421         0
location 7369         0


In [39]:
null_locs = ['location 6364', 'location 4525', 'location 868', 'location 959', 'location 7048']

for x in null_locs:
    print(f"{round(len(train_df[train_df['deidentified_location'] == x])/len(train_df) *100,3)}% of the total observations are in location '{x}' \n")
    print(f"Percentage of null values of all the observations at location '{x}'")    
    print(train_df[train_df['deidentified_location'] == x][['cloudiness', 'pressure_sea_level', 'precipitation']].isna().mean()*100)
    print('*********************************\n')
    

3.996% of the total observations are in location 'location 6364' 

Percentage of null values of all the observations at location 'location 6364'
cloudiness              0.0
pressure_sea_level    100.0
precipitation         100.0
dtype: float64
*********************************

1.753% of the total observations are in location 'location 4525' 

Percentage of null values of all the observations at location 'location 4525'
cloudiness            100.0
pressure_sea_level      0.0
precipitation           0.0
dtype: float64
*********************************

0.572% of the total observations are in location 'location 868' 

Percentage of null values of all the observations at location 'location 868'
cloudiness            100.0
pressure_sea_level      0.0
precipitation           0.0
dtype: float64
*********************************

2.691% of the total observations are in location 'location 959' 

Percentage of null values of all the observations at location 'location 959'
cloudiness            

**At a few locations, some of the features have all values as null, for those values, in each feature, I have considered the mean of all the observations.**

In [40]:
%%time

# I am using only ['cloudiness', 'pressure_sea_level', 'precipitation'] because there are no null values at other locations
count = 1
for x in null_locs:
    for y in ['cloudiness', 'pressure_sea_level', 'precipitation']:
            
            if sum(train_df.loc[(train_df['deidentified_location'] == x), y].isna()) > 0:
            
                re = train_df[y].mean()
            
                train_df.loc[(train_df['deidentified_location'] == x), y] = train_df.loc[(train_df['deidentified_location'] == x), y].fillna(re)
            
                imputed_values = imputed_values.append(pd.Series([x, y, re],index= imputed_values.columns), ignore_index=True)
            
                     
                print(f'iter: {count}')
                count += 1

iter: 1
iter: 2
iter: 3
iter: 4
iter: 5
iter: 6
Wall time: 4.35 s


In [41]:
train_df.isna().mean()*100

timestamp                0.0
farm_id                  0.0
ingredient_type          0.0
yield                    0.0
farm_area                0.0
farming_company          0.0
deidentified_location    0.0
temp_obs                 0.0
cloudiness               0.0
wind_direction           0.0
dew_temp                 0.0
pressure_sea_level       0.0
precipitation            0.0
wind_speed               0.0
dtype: float64

In [42]:
train_df.dtypes

timestamp                datetime64[ns]
farm_id                        category
ingredient_type                category
yield                           float32
farm_area                       float32
farming_company                category
deidentified_location          category
temp_obs                        float32
cloudiness                      float32
wind_direction                  float32
dew_temp                        float32
pressure_sea_level              float32
precipitation                   float32
wind_speed                      float32
dtype: object

**While imputing with means, a few of the locations which have complete nulls are imputed with nulls and those are saved in the dataframe, removing those nulls**

In [43]:
null_rows_bool = pd.isnull(imputed_values['imputed_value'])
imputed_values[null_rows_bool]

Unnamed: 0,deidentified_location,imputed_column,imputed_value
29,location 868,cloudiness,
36,location 4525,cloudiness,
47,location 959,precipitation,
61,location 7048,precipitation,
81,location 6364,pressure_sea_level,
82,location 6364,precipitation,


In [44]:
null_rows = imputed_values[null_rows_bool].index
null_rows

Int64Index([29, 36, 47, 61, 81, 82], dtype='int64')

In [45]:
imputed_values.drop(null_rows, axis = 0, inplace = True)
imputed_values.reset_index(drop = True, inplace = True)
imputed_values

Unnamed: 0,deidentified_location,imputed_column,imputed_value
0,location 7369,temp_obs,23.054689
1,location 7369,cloudiness,3.048423
2,location 7369,wind_direction,155.227356
3,location 7369,dew_temp,17.055305
4,location 7369,pressure_sea_level,1022.635681
...,...,...,...
107,location 6364,precipitation,1.301377
108,location 4525,cloudiness,1.806017
109,location 868,cloudiness,1.799751
110,location 959,precipitation,1.310004


In [46]:
train_df.tail()

Unnamed: 0,timestamp,farm_id,ingredient_type,yield,farm_area,farming_company,deidentified_location,temp_obs,cloudiness,wind_direction,dew_temp,pressure_sea_level,precipitation,wind_speed
10261720,2016-12-31 23:00:00,fid_70492,ing_w,30.27,1047.388428,Sanderson Farms,location 5290,8.9,3.878274,200.0,-5.6,1015.5,0.0,6.2
10261719,2016-12-31 23:00:00,fid_35085,ing_w,85.779999,25204.583984,Obery Farms,location 5290,8.9,3.878274,200.0,-5.6,1015.5,0.0,6.2
10261718,2016-12-31 23:00:00,fid_109740,ing_w,113.5,8000.806152,Sanderson Farms,location 5290,8.9,3.878274,200.0,-5.6,1015.5,0.0,6.2
10261727,2016-12-31 23:00:00,fid_99209,ing_w,13.85,129.692581,Sanderson Farms,location 5290,8.9,3.878274,200.0,-5.6,1015.5,0.0,6.2
17455590,2016-12-31 23:00:00,fid_39141,ing_x,67.453201,3546.38623,Wayne Farms,location 8421,24.4,0.0,116.513313,16.1,1007.900024,0.0,1.5


**Doing some rounding of, to compat with the units of those features.**
* Cloudiness should always be in integer form between 0 and 9 (both inclusive).
* In the given data, wind_direction is in integer form, so rounding it to the nearest integer.

In [48]:
for col in ['cloudiness', 'wind_direction']:
    train_df[col] = train_df[col].apply(lambda x: round(x,0))

In [49]:
for x in ['temp_obs', 'dew_temp', 'pressure_sea_level', 'wind_speed', 'precipitation']:
    train_df[x] = train_df[x].apply(lambda x: round(x,1))

In [50]:
train_df.tail()

Unnamed: 0,timestamp,farm_id,ingredient_type,yield,farm_area,farming_company,deidentified_location,temp_obs,cloudiness,wind_direction,dew_temp,pressure_sea_level,precipitation,wind_speed
10261720,2016-12-31 23:00:00,fid_70492,ing_w,30.27,1047.388428,Sanderson Farms,location 5290,8.9,4.0,200.0,-5.6,1015.5,0.0,6.2
10261719,2016-12-31 23:00:00,fid_35085,ing_w,85.779999,25204.583984,Obery Farms,location 5290,8.9,4.0,200.0,-5.6,1015.5,0.0,6.2
10261718,2016-12-31 23:00:00,fid_109740,ing_w,113.5,8000.806152,Sanderson Farms,location 5290,8.9,4.0,200.0,-5.6,1015.5,0.0,6.2
10261727,2016-12-31 23:00:00,fid_99209,ing_w,13.85,129.692581,Sanderson Farms,location 5290,8.9,4.0,200.0,-5.6,1015.5,0.0,6.2
17455590,2016-12-31 23:00:00,fid_39141,ing_x,67.453201,3546.38623,Wayne Farms,location 8421,24.4,0.0,117.0,16.1,1007.9,0.0,1.5


**As we can't directly use the timestamp in building the model, splitting the timestamp into year, month, date and hour, so the caracteristics at a partcular moment are saved.**

In [51]:
%%time


train_df['year'] = train_df['timestamp'].dt.year

train_df['month']=train_df['timestamp'].dt.month

train_df['date'] = train_df['timestamp'].dt.day

train_df['hour'] = train_df['timestamp'].dt.hour

Wall time: 6.46 s


In [52]:
print(train_df.year.unique())
print('\n')
print(train_df.month.unique())
print('\n')
print(train_df.date.unique())
print('\n')
print(train_df.hour.unique())

[2016]


[ 1  2  3  4  5  6  7  8  9 10 11 12]


[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31]


[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]


In [53]:
# The year column doesn't have any variation, so removing it.
train_df.drop('year', axis = 1, inplace = True)

In [54]:
train_df.isna().sum()

timestamp                0
farm_id                  0
ingredient_type          0
yield                    0
farm_area                0
farming_company          0
deidentified_location    0
temp_obs                 0
cloudiness               0
wind_direction           0
dew_temp                 0
pressure_sea_level       0
precipitation            0
wind_speed               0
month                    0
date                     0
hour                     0
dtype: int64

In [55]:
train_df.dtypes

timestamp                datetime64[ns]
farm_id                        category
ingredient_type                category
yield                           float32
farm_area                       float32
farming_company                category
deidentified_location          category
temp_obs                        float64
cloudiness                      float64
wind_direction                  float64
dew_temp                        float64
pressure_sea_level              float64
precipitation                   float64
wind_speed                      float64
month                             int64
date                              int64
hour                              int64
dtype: object

**Reducing the bit size to 32 to decrease the file size.**

In [56]:
int_cols = train_df.select_dtypes('int64').columns
train_df[int_cols] = train_df[int_cols].astype('int32')

In [57]:
float_cols = train_df.select_dtypes('float64').columns
train_df[float_cols] = train_df[float_cols].astype('float32')

In [58]:
train_df.dtypes

timestamp                datetime64[ns]
farm_id                        category
ingredient_type                category
yield                           float32
farm_area                       float32
farming_company                category
deidentified_location          category
temp_obs                        float32
cloudiness                      float32
wind_direction                  float32
dew_temp                        float32
pressure_sea_level              float32
precipitation                   float32
wind_speed                      float32
month                             int32
date                              int32
hour                              int32
dtype: object

In [59]:
train_df.drop('timestamp', axis = 1,inplace = True)

In [60]:
train_df.dtypes

farm_id                  category
ingredient_type          category
yield                     float32
farm_area                 float32
farming_company          category
deidentified_location    category
temp_obs                  float32
cloudiness                float32
wind_direction            float32
dew_temp                  float32
pressure_sea_level        float32
precipitation             float32
wind_speed                float32
month                       int32
date                        int32
hour                        int32
dtype: object

### Saving the dataframes after this preprocessing, so that no need to run this entire code again and again.

In [67]:
%%time
#train_df.to_csv('final_train_data_32bit.csv', index = False)

Wall time: 4min 2s


In [68]:
#imputed_values.to_csv('train_imputed_values.csv', index = False)