## Importing Libraries

The code starts by importing several Python libraries commonly used in data science and machine learning tasks. These include:

- `pandas` for data manipulation and analysis
- `matplotlib` for creating visualizations
- `seaborn` for creating more aesthetically pleasing statistical graphics
- `numpy` for numerical computing

In [40]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os

%matplotlib inline

## Defining a Helper Function for reading data

The code defines a helper function called `printLines` that takes a file name and an optional integer `n` as input. This function opens the specified file, reads all of the lines, and prints the first `n` lines to the console.

The code defines two variables that contain the paths to two CSV files containing climate data for Kenya. It then uses the `pd.read_csv()` function from the `pandas` library to read the contents of each file into a DataFrame. The resulting DataFrames are stored in the `rainfall_df` and `temp_df` variables.

In [41]:
def print_lines(file, n=10):
    with open(file, 'rb') as datafile:
        lines = datafile.readlines()
        print(len(lines))
    for line in lines[:n]:
        print(line)

rainfall_data = os.path.join("data", "kenya-climate-data-1991-2016-rainfallmm.csv")
temp_data = os.path.join("data", "kenya-climate-data-1991-2016-temp-degrees-celcius.csv")
exp_data =  os.path.join("data", "tourist_expenditure.csv")

print_lines(rainfall_data)
print_lines(temp_data)
print_lines(exp_data)

rainfall_df = pd.read_csv(rainfall_data)
temp_df = pd.read_csv(temp_data)
exp_df = pd.read_csv(exp_data)

313
b'Year,Month Average,Rainfall - (MM)\r\n'
b'1991,Jan Average,38.2847\r\n'
b'1991,Feb Average,12.7492\r\n'
b'1991,Mar Average,73.3656\r\n'
b'1991,Apr Average,83.135\r\n'
b'1991,May Average,112.275\r\n'
b'1991,Jun Average,33.6106\r\n'
b'1991,Jul Average,36.6575\r\n'
b'1991,Aug Average,32.8066\r\n'
b'1991,Sep Average,18.3184\r\n'
313
b'Year,Month Average,Temperature - (Celsius)\r\n'
b'1991,Jan Average,25.1631\r\n'
b'1991,Feb Average,26.0839\r\n'
b'1991,Mar Average,26.2236\r\n'
b'1991,Apr Average,25.5812\r\n'
b'1991,May Average,24.6618\r\n'
b'1991,Jun Average,23.9439\r\n'
b'1991,Jul Average,22.9982\r\n'
b'1991,Aug Average,23.0391\r\n'
b'1991,Sep Average,23.9423\r\n'
297
b'Age,Gender,Income (USD),Duration of Stay,Num. of People,Accommodation,Travel Expenditure (USD),Destination\r\n'
b'26,Male,50000,10,2,Hotel,3000,Nairobi\r\n'
b'32,Female,70000,7,1,Hostel,2500,Mombasa\r\n'
b'45,Male,80000,14,3,Motel,5000,Narok\r\n'
b'38,Female,90000,5,1,Hotel,2000,Kwale\r\n'
b'54,Male,60000,21,2,other,8

## Balance rows and Concatenat dataframes


In [42]:
dates = pd.date_range(start='1992-05-01', end='2016-12-31', freq='M')
print(len(dates))

296


In [43]:
rainfall_df = rainfall_df.drop(rainfall_df.index[:16])
temp_df = temp_df.drop(temp_df.index[:16])
rainfall_df["Rainfall - (MM)"] = rainfall_df["Rainfall - (MM)"].astype("float32")
temp_only_df = temp_df["Temperature - (Celsius)"].astype("float32")
final_df = rainfall_df.join(temp_only_df)
final_df['dates'] = dates
final_df

Unnamed: 0,Year,Month Average,Rainfall - (MM),Temperature - (Celsius),dates
16,1992,May Average,80.600098,24.785200,1992-05-31
17,1992,Jun Average,30.084200,24.056299,1992-06-30
18,1992,Jul Average,35.107899,22.837700,1992-07-31
19,1992,Aug Average,28.012699,22.790199,1992-08-31
20,1992,Sep Average,24.312099,23.895800,1992-09-30
...,...,...,...,...,...
307,2016,Aug Average,25.534201,24.094200,2016-08-31
308,2016,Sep Average,15.142800,24.437000,2016-09-30
309,2016,Oct Average,40.005501,26.031700,2016-10-31
310,2016,Nov Average,121.997002,25.569201,2016-11-30


In [44]:
final_all_df = pd.concat([final_df, exp_df], axis=1)
final_all_df

Unnamed: 0,Year,Month Average,Rainfall - (MM),Temperature - (Celsius),dates,Age,Gender,Income (USD),Duration of Stay,Num. of People,Accommodation,Travel Expenditure (USD),Destination
16,1992.0,May Average,80.600098,24.785200,1992-05-31,25.0,Male,35000.0,3.0,1.0,Hostel,1000.0,Nakuru
17,1992.0,Jun Average,30.084200,24.056299,1992-06-30,30.0,Female,50000.0,4.0,1.0,other,1200.0,Naivasha
18,1992.0,Jul Average,35.107899,22.837700,1992-07-31,44.0,Male,95000.0,6.0,2.0,Hotel,3500.0,Mombasa
19,1992.0,Aug Average,28.012699,22.790199,1992-08-31,37.0,Female,80000.0,8.0,1.0,Hostel,3000.0,Taita-Taveta
20,1992.0,Sep Average,24.312099,23.895800,1992-09-30,26.0,Male,55000.0,7.0,3.0,Motel,2500.0,Kwale
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11,,,,,NaT,39.0,Female,85000.0,10.0,1.0,Hostel,3500.0,Taita-Taveta
12,,,,,NaT,28.0,Male,45000.0,5.0,1.0,Hotel,2000.0,Kajiado
13,,,,,NaT,33.0,Female,60000.0,7.0,2.0,Motel,2500.0,Kisumu
14,,,,,NaT,48.0,Male,110000.0,14.0,3.0,other,6000.0,Malindi


## Drop any row with Null values

In [45]:
final_all_df = final_all_df.dropna()
final_all_df

Unnamed: 0,Year,Month Average,Rainfall - (MM),Temperature - (Celsius),dates,Age,Gender,Income (USD),Duration of Stay,Num. of People,Accommodation,Travel Expenditure (USD),Destination
16,1992.0,May Average,80.600098,24.785200,1992-05-31,25.0,Male,35000.0,3.0,1.0,Hostel,1000.0,Nakuru
17,1992.0,Jun Average,30.084200,24.056299,1992-06-30,30.0,Female,50000.0,4.0,1.0,other,1200.0,Naivasha
18,1992.0,Jul Average,35.107899,22.837700,1992-07-31,44.0,Male,95000.0,6.0,2.0,Hotel,3500.0,Mombasa
19,1992.0,Aug Average,28.012699,22.790199,1992-08-31,37.0,Female,80000.0,8.0,1.0,Hostel,3000.0,Taita-Taveta
20,1992.0,Sep Average,24.312099,23.895800,1992-09-30,26.0,Male,55000.0,7.0,3.0,Motel,2500.0,Kwale
...,...,...,...,...,...,...,...,...,...,...,...,...,...
291,2015.0,Apr Average,136.337997,26.070601,2015-04-30,27.0,Male,30000.0,4.0,2.0,Hotel,80000.0,Mombasa
292,2015.0,May Average,101.457001,25.066299,2015-05-31,45.0,Female,50000.0,7.0,4.0,Villa,150000.0,Lamu
293,2015.0,Jun Average,70.928398,24.588400,2015-06-30,36.0,Male,70000.0,5.0,3.0,Hotel,100000.0,Nairobi
294,2015.0,Jul Average,20.793100,24.462200,2015-07-31,29.0,Female,45000.0,3.0,1.0,Airbnb,50000.0,Kisumu


In [46]:
month_dummies = pd.get_dummies(final_all_df["Month Average"])
month_dummies

Unnamed: 0,Apr Average,Aug Average,Dec Average,Feb Average,Jan Average,Jul Average,Jun Average,Mar Average,May Average,Nov Average,Oct Average,Sep Average
16,0,0,0,0,0,0,0,0,1,0,0,0
17,0,0,0,0,0,0,1,0,0,0,0,0
18,0,0,0,0,0,1,0,0,0,0,0,0
19,0,1,0,0,0,0,0,0,0,0,0,0
20,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
291,1,0,0,0,0,0,0,0,0,0,0,0
292,0,0,0,0,0,0,0,0,1,0,0,0
293,0,0,0,0,0,0,1,0,0,0,0,0
294,0,0,0,0,0,1,0,0,0,0,0,0


In [47]:
final_all_df = final_all_df.join(month_dummies)
final_all_df

Unnamed: 0,Year,Month Average,Rainfall - (MM),Temperature - (Celsius),dates,Age,Gender,Income (USD),Duration of Stay,Num. of People,...,Dec Average,Feb Average,Jan Average,Jul Average,Jun Average,Mar Average,May Average,Nov Average,Oct Average,Sep Average
16,1992.0,May Average,80.600098,24.785200,1992-05-31,25.0,Male,35000.0,3.0,1.0,...,0,0,0,0,0,0,1,0,0,0
17,1992.0,Jun Average,30.084200,24.056299,1992-06-30,30.0,Female,50000.0,4.0,1.0,...,0,0,0,0,1,0,0,0,0,0
18,1992.0,Jul Average,35.107899,22.837700,1992-07-31,44.0,Male,95000.0,6.0,2.0,...,0,0,0,1,0,0,0,0,0,0
19,1992.0,Aug Average,28.012699,22.790199,1992-08-31,37.0,Female,80000.0,8.0,1.0,...,0,0,0,0,0,0,0,0,0,0
20,1992.0,Sep Average,24.312099,23.895800,1992-09-30,26.0,Male,55000.0,7.0,3.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
291,2015.0,Apr Average,136.337997,26.070601,2015-04-30,27.0,Male,30000.0,4.0,2.0,...,0,0,0,0,0,0,0,0,0,0
292,2015.0,May Average,101.457001,25.066299,2015-05-31,45.0,Female,50000.0,7.0,4.0,...,0,0,0,0,0,0,1,0,0,0
293,2015.0,Jun Average,70.928398,24.588400,2015-06-30,36.0,Male,70000.0,5.0,3.0,...,0,0,0,0,1,0,0,0,0,0
294,2015.0,Jul Average,20.793100,24.462200,2015-07-31,29.0,Female,45000.0,3.0,1.0,...,0,0,0,1,0,0,0,0,0,0


In [48]:
final_all_df = final_all_df.set_index("dates")
final_all_df

Unnamed: 0_level_0,Year,Month Average,Rainfall - (MM),Temperature - (Celsius),Age,Gender,Income (USD),Duration of Stay,Num. of People,Accommodation,...,Dec Average,Feb Average,Jan Average,Jul Average,Jun Average,Mar Average,May Average,Nov Average,Oct Average,Sep Average
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1992-05-31,1992.0,May Average,80.600098,24.785200,25.0,Male,35000.0,3.0,1.0,Hostel,...,0,0,0,0,0,0,1,0,0,0
1992-06-30,1992.0,Jun Average,30.084200,24.056299,30.0,Female,50000.0,4.0,1.0,other,...,0,0,0,0,1,0,0,0,0,0
1992-07-31,1992.0,Jul Average,35.107899,22.837700,44.0,Male,95000.0,6.0,2.0,Hotel,...,0,0,0,1,0,0,0,0,0,0
1992-08-31,1992.0,Aug Average,28.012699,22.790199,37.0,Female,80000.0,8.0,1.0,Hostel,...,0,0,0,0,0,0,0,0,0,0
1992-09-30,1992.0,Sep Average,24.312099,23.895800,26.0,Male,55000.0,7.0,3.0,Motel,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-04-30,2015.0,Apr Average,136.337997,26.070601,27.0,Male,30000.0,4.0,2.0,Hotel,...,0,0,0,0,0,0,0,0,0,0
2015-05-31,2015.0,May Average,101.457001,25.066299,45.0,Female,50000.0,7.0,4.0,Villa,...,0,0,0,0,0,0,1,0,0,0
2015-06-30,2015.0,Jun Average,70.928398,24.588400,36.0,Male,70000.0,5.0,3.0,Hotel,...,0,0,0,0,1,0,0,0,0,0
2015-07-31,2015.0,Jul Average,20.793100,24.462200,29.0,Female,45000.0,3.0,1.0,Airbnb,...,0,0,0,1,0,0,0,0,0,0


In [49]:
high_months = [7, 8, 12]
final_all_df.loc[final_all_df.index.month.isin(high_months), 'Travel Expenditure (USD)'] *= 1.5
final_all_df

Unnamed: 0_level_0,Year,Month Average,Rainfall - (MM),Temperature - (Celsius),Age,Gender,Income (USD),Duration of Stay,Num. of People,Accommodation,...,Dec Average,Feb Average,Jan Average,Jul Average,Jun Average,Mar Average,May Average,Nov Average,Oct Average,Sep Average
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1992-05-31,1992.0,May Average,80.600098,24.785200,25.0,Male,35000.0,3.0,1.0,Hostel,...,0,0,0,0,0,0,1,0,0,0
1992-06-30,1992.0,Jun Average,30.084200,24.056299,30.0,Female,50000.0,4.0,1.0,other,...,0,0,0,0,1,0,0,0,0,0
1992-07-31,1992.0,Jul Average,35.107899,22.837700,44.0,Male,95000.0,6.0,2.0,Hotel,...,0,0,0,1,0,0,0,0,0,0
1992-08-31,1992.0,Aug Average,28.012699,22.790199,37.0,Female,80000.0,8.0,1.0,Hostel,...,0,0,0,0,0,0,0,0,0,0
1992-09-30,1992.0,Sep Average,24.312099,23.895800,26.0,Male,55000.0,7.0,3.0,Motel,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-04-30,2015.0,Apr Average,136.337997,26.070601,27.0,Male,30000.0,4.0,2.0,Hotel,...,0,0,0,0,0,0,0,0,0,0
2015-05-31,2015.0,May Average,101.457001,25.066299,45.0,Female,50000.0,7.0,4.0,Villa,...,0,0,0,0,0,0,1,0,0,0
2015-06-30,2015.0,Jun Average,70.928398,24.588400,36.0,Male,70000.0,5.0,3.0,Hotel,...,0,0,0,0,1,0,0,0,0,0
2015-07-31,2015.0,Jul Average,20.793100,24.462200,29.0,Female,45000.0,3.0,1.0,Airbnb,...,0,0,0,1,0,0,0,0,0,0


In [50]:
final_all_df['Travel Expenditure (USD)']

dates
1992-05-31      1000.0
1992-06-30      1200.0
1992-07-31      5250.0
1992-08-31      4500.0
1992-09-30      2500.0
                ...   
2015-04-30     80000.0
2015-05-31    150000.0
2015-06-30    100000.0
2015-07-31     75000.0
2015-08-31    180000.0
Name: Travel Expenditure (USD), Length: 280, dtype: float64

In [51]:
final_all_df = final_all_df.reset_index()
final_all_df

Unnamed: 0,dates,Year,Month Average,Rainfall - (MM),Temperature - (Celsius),Age,Gender,Income (USD),Duration of Stay,Num. of People,...,Dec Average,Feb Average,Jan Average,Jul Average,Jun Average,Mar Average,May Average,Nov Average,Oct Average,Sep Average
0,1992-05-31,1992.0,May Average,80.600098,24.785200,25.0,Male,35000.0,3.0,1.0,...,0,0,0,0,0,0,1,0,0,0
1,1992-06-30,1992.0,Jun Average,30.084200,24.056299,30.0,Female,50000.0,4.0,1.0,...,0,0,0,0,1,0,0,0,0,0
2,1992-07-31,1992.0,Jul Average,35.107899,22.837700,44.0,Male,95000.0,6.0,2.0,...,0,0,0,1,0,0,0,0,0,0
3,1992-08-31,1992.0,Aug Average,28.012699,22.790199,37.0,Female,80000.0,8.0,1.0,...,0,0,0,0,0,0,0,0,0,0
4,1992-09-30,1992.0,Sep Average,24.312099,23.895800,26.0,Male,55000.0,7.0,3.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
275,2015-04-30,2015.0,Apr Average,136.337997,26.070601,27.0,Male,30000.0,4.0,2.0,...,0,0,0,0,0,0,0,0,0,0
276,2015-05-31,2015.0,May Average,101.457001,25.066299,45.0,Female,50000.0,7.0,4.0,...,0,0,0,0,0,0,1,0,0,0
277,2015-06-30,2015.0,Jun Average,70.928398,24.588400,36.0,Male,70000.0,5.0,3.0,...,0,0,0,0,1,0,0,0,0,0
278,2015-07-31,2015.0,Jul Average,20.793100,24.462200,29.0,Female,45000.0,3.0,1.0,...,0,0,0,1,0,0,0,0,0,0


In [52]:
final_all_df = final_all_df.drop(["Month Average", "Jan Average", "dates"], axis="columns")
final_all_df

Unnamed: 0,Year,Rainfall - (MM),Temperature - (Celsius),Age,Gender,Income (USD),Duration of Stay,Num. of People,Accommodation,Travel Expenditure (USD),...,Aug Average,Dec Average,Feb Average,Jul Average,Jun Average,Mar Average,May Average,Nov Average,Oct Average,Sep Average
0,1992.0,80.600098,24.785200,25.0,Male,35000.0,3.0,1.0,Hostel,1000.0,...,0,0,0,0,0,0,1,0,0,0
1,1992.0,30.084200,24.056299,30.0,Female,50000.0,4.0,1.0,other,1200.0,...,0,0,0,0,1,0,0,0,0,0
2,1992.0,35.107899,22.837700,44.0,Male,95000.0,6.0,2.0,Hotel,5250.0,...,0,0,0,1,0,0,0,0,0,0
3,1992.0,28.012699,22.790199,37.0,Female,80000.0,8.0,1.0,Hostel,4500.0,...,1,0,0,0,0,0,0,0,0,0
4,1992.0,24.312099,23.895800,26.0,Male,55000.0,7.0,3.0,Motel,2500.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
275,2015.0,136.337997,26.070601,27.0,Male,30000.0,4.0,2.0,Hotel,80000.0,...,0,0,0,0,0,0,0,0,0,0
276,2015.0,101.457001,25.066299,45.0,Female,50000.0,7.0,4.0,Villa,150000.0,...,0,0,0,0,0,0,1,0,0,0
277,2015.0,70.928398,24.588400,36.0,Male,70000.0,5.0,3.0,Hotel,100000.0,...,0,0,0,0,1,0,0,0,0,0
278,2015.0,20.793100,24.462200,29.0,Female,45000.0,3.0,1.0,Airbnb,75000.0,...,0,0,0,1,0,0,0,0,0,0


In [53]:
accomodation_dummies = pd.get_dummies(final_all_df["Accommodation"])
gender_dummies = pd.get_dummies(final_all_df['Gender'])
final_all_df = final_all_df.join(gender_dummies)
final_all_df

Unnamed: 0,Year,Rainfall - (MM),Temperature - (Celsius),Age,Gender,Income (USD),Duration of Stay,Num. of People,Accommodation,Travel Expenditure (USD),...,Feb Average,Jul Average,Jun Average,Mar Average,May Average,Nov Average,Oct Average,Sep Average,Female,Male
0,1992.0,80.600098,24.785200,25.0,Male,35000.0,3.0,1.0,Hostel,1000.0,...,0,0,0,0,1,0,0,0,0,1
1,1992.0,30.084200,24.056299,30.0,Female,50000.0,4.0,1.0,other,1200.0,...,0,0,1,0,0,0,0,0,1,0
2,1992.0,35.107899,22.837700,44.0,Male,95000.0,6.0,2.0,Hotel,5250.0,...,0,1,0,0,0,0,0,0,0,1
3,1992.0,28.012699,22.790199,37.0,Female,80000.0,8.0,1.0,Hostel,4500.0,...,0,0,0,0,0,0,0,0,1,0
4,1992.0,24.312099,23.895800,26.0,Male,55000.0,7.0,3.0,Motel,2500.0,...,0,0,0,0,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
275,2015.0,136.337997,26.070601,27.0,Male,30000.0,4.0,2.0,Hotel,80000.0,...,0,0,0,0,0,0,0,0,0,1
276,2015.0,101.457001,25.066299,45.0,Female,50000.0,7.0,4.0,Villa,150000.0,...,0,0,0,0,1,0,0,0,1,0
277,2015.0,70.928398,24.588400,36.0,Male,70000.0,5.0,3.0,Hotel,100000.0,...,0,0,1,0,0,0,0,0,0,1
278,2015.0,20.793100,24.462200,29.0,Female,45000.0,3.0,1.0,Airbnb,75000.0,...,0,1,0,0,0,0,0,0,1,0


In [54]:
final_all_df = final_all_df.join(accomodation_dummies)
final_all_df = final_all_df.drop(["Gender", "Female", "Accommodation", "other"], axis="columns")
final_all_df

Unnamed: 0,Year,Rainfall - (MM),Temperature - (Celsius),Age,Income (USD),Duration of Stay,Num. of People,Travel Expenditure (USD),Destination,Apr Average,...,Sep Average,Male,Airbnb,Campsite,Guest House,Hostel,Hotel,Motel,Resort,Villa
0,1992.0,80.600098,24.785200,25.0,35000.0,3.0,1.0,1000.0,Nakuru,0,...,0,1,0,0,0,1,0,0,0,0
1,1992.0,30.084200,24.056299,30.0,50000.0,4.0,1.0,1200.0,Naivasha,0,...,0,0,0,0,0,0,0,0,0,0
2,1992.0,35.107899,22.837700,44.0,95000.0,6.0,2.0,5250.0,Mombasa,0,...,0,1,0,0,0,0,1,0,0,0
3,1992.0,28.012699,22.790199,37.0,80000.0,8.0,1.0,4500.0,Taita-Taveta,0,...,0,0,0,0,0,1,0,0,0,0
4,1992.0,24.312099,23.895800,26.0,55000.0,7.0,3.0,2500.0,Kwale,0,...,1,1,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
275,2015.0,136.337997,26.070601,27.0,30000.0,4.0,2.0,80000.0,Mombasa,1,...,0,1,0,0,0,0,1,0,0,0
276,2015.0,101.457001,25.066299,45.0,50000.0,7.0,4.0,150000.0,Lamu,0,...,0,0,0,0,0,0,0,0,0,1
277,2015.0,70.928398,24.588400,36.0,70000.0,5.0,3.0,100000.0,Nairobi,0,...,0,1,0,0,0,0,1,0,0,0
278,2015.0,20.793100,24.462200,29.0,45000.0,3.0,1.0,75000.0,Kisumu,0,...,0,0,1,0,0,0,0,0,0,0


## Visualization

In [55]:
final_all_df

Unnamed: 0,Year,Rainfall - (MM),Temperature - (Celsius),Age,Income (USD),Duration of Stay,Num. of People,Travel Expenditure (USD),Destination,Apr Average,...,Sep Average,Male,Airbnb,Campsite,Guest House,Hostel,Hotel,Motel,Resort,Villa
0,1992.0,80.600098,24.785200,25.0,35000.0,3.0,1.0,1000.0,Nakuru,0,...,0,1,0,0,0,1,0,0,0,0
1,1992.0,30.084200,24.056299,30.0,50000.0,4.0,1.0,1200.0,Naivasha,0,...,0,0,0,0,0,0,0,0,0,0
2,1992.0,35.107899,22.837700,44.0,95000.0,6.0,2.0,5250.0,Mombasa,0,...,0,1,0,0,0,0,1,0,0,0
3,1992.0,28.012699,22.790199,37.0,80000.0,8.0,1.0,4500.0,Taita-Taveta,0,...,0,0,0,0,0,1,0,0,0,0
4,1992.0,24.312099,23.895800,26.0,55000.0,7.0,3.0,2500.0,Kwale,0,...,1,1,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
275,2015.0,136.337997,26.070601,27.0,30000.0,4.0,2.0,80000.0,Mombasa,1,...,0,1,0,0,0,0,1,0,0,0
276,2015.0,101.457001,25.066299,45.0,50000.0,7.0,4.0,150000.0,Lamu,0,...,0,0,0,0,0,0,0,0,0,1
277,2015.0,70.928398,24.588400,36.0,70000.0,5.0,3.0,100000.0,Nairobi,0,...,0,1,0,0,0,0,1,0,0,0
278,2015.0,20.793100,24.462200,29.0,45000.0,3.0,1.0,75000.0,Kisumu,0,...,0,0,1,0,0,0,0,0,0,0


In [56]:
final_all_df = final_all_df.drop(['Destination'], axis="columns")

In [57]:
final_all_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 28 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Year                      280 non-null    float64
 1   Rainfall - (MM)           280 non-null    float32
 2   Temperature - (Celsius)   280 non-null    float32
 3   Age                       280 non-null    float64
 4   Income (USD)              280 non-null    float64
 5   Duration of Stay          280 non-null    float64
 6   Num. of People            280 non-null    float64
 7   Travel Expenditure (USD)  280 non-null    float64
 8   Apr Average               280 non-null    uint8  
 9   Aug Average               280 non-null    uint8  
 10  Dec Average               280 non-null    uint8  
 11  Feb Average               280 non-null    uint8  
 12  Jul Average               280 non-null    uint8  
 13  Jun Average               280 non-null    uint8  
 14  Mar Averag

In [58]:
# for col in df.columns[:-1]:
#     plt.scatter(df[col], df['Travel Expenditure (USD)'])
#     plt.xlabel(col)
#     plt.ylabel('Travel Expenditure (USD)')
#     plt.show()

In [59]:
y = final_all_df["Travel Expenditure (USD)"]
x = final_all_df.drop("Travel Expenditure (USD)", axis=1)

In [60]:
from sklearn.model_selection import train_test_split

In [61]:
X_train, X_test, Y_train, Y_test= train_test_split(x,y, test_size=0.2)

In [62]:
from sklearn.linear_model import LinearRegression

model1 = LinearRegression()
model1.fit(X_train, Y_train)

In [63]:
final_all_df

Unnamed: 0,Year,Rainfall - (MM),Temperature - (Celsius),Age,Income (USD),Duration of Stay,Num. of People,Travel Expenditure (USD),Apr Average,Aug Average,...,Sep Average,Male,Airbnb,Campsite,Guest House,Hostel,Hotel,Motel,Resort,Villa
0,1992.0,80.600098,24.785200,25.0,35000.0,3.0,1.0,1000.0,0,0,...,0,1,0,0,0,1,0,0,0,0
1,1992.0,30.084200,24.056299,30.0,50000.0,4.0,1.0,1200.0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1992.0,35.107899,22.837700,44.0,95000.0,6.0,2.0,5250.0,0,0,...,0,1,0,0,0,0,1,0,0,0
3,1992.0,28.012699,22.790199,37.0,80000.0,8.0,1.0,4500.0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,1992.0,24.312099,23.895800,26.0,55000.0,7.0,3.0,2500.0,0,0,...,1,1,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
275,2015.0,136.337997,26.070601,27.0,30000.0,4.0,2.0,80000.0,1,0,...,0,1,0,0,0,0,1,0,0,0
276,2015.0,101.457001,25.066299,45.0,50000.0,7.0,4.0,150000.0,0,0,...,0,0,0,0,0,0,0,0,0,1
277,2015.0,70.928398,24.588400,36.0,70000.0,5.0,3.0,100000.0,0,0,...,0,1,0,0,0,0,1,0,0,0
278,2015.0,20.793100,24.462200,29.0,45000.0,3.0,1.0,75000.0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [64]:
corr=final_all_df.corr(method="pearson")
corr

Unnamed: 0,Year,Rainfall - (MM),Temperature - (Celsius),Age,Income (USD),Duration of Stay,Num. of People,Travel Expenditure (USD),Apr Average,Aug Average,...,Sep Average,Male,Airbnb,Campsite,Guest House,Hostel,Hotel,Motel,Resort,Villa
Year,1.0,0.010338,0.248235,0.006963,-0.095369,-0.364253,0.224278,0.441733,0.022173,-3.09067e-16,...,-0.022173,0.020126,0.306947,0.035692,0.222336,-0.162366,-0.095129,-0.269739,0.281025,0.175771
Rainfall - (MM),0.0103384,1.0,0.238837,-0.011395,-0.005584,0.017946,0.085008,-0.007614,0.434844,-0.1850492,...,-0.19132,0.020559,0.024641,-0.036752,-0.055694,-0.069919,0.054777,-0.018307,0.055887,-0.018382
Temperature - (Celsius),0.2482352,0.238837,1.0,-0.062317,-0.066689,-0.049014,0.098647,0.066325,0.269726,-0.3568244,...,-0.130096,0.009946,0.068305,-0.065445,-0.010092,-0.023355,-0.039029,-0.02812,0.079455,0.06572
Age,0.006962731,-0.011395,-0.062317,1.0,0.757832,0.501331,0.249745,0.316321,-0.003891,0.1006506,...,-0.076895,0.099966,0.004205,0.094778,-0.164643,-0.16558,0.124026,-0.102061,0.099502,0.089909
Income (USD),-0.09536935,-0.005584,-0.066689,0.757832,1.0,0.525722,0.224891,0.514348,-0.005238,0.06883076,...,-0.063469,0.062184,0.024569,0.048301,-0.198564,-0.159818,0.150012,-0.122133,0.097164,0.188
Duration of Stay,-0.3642533,0.017946,-0.049014,0.501331,0.525722,1.0,0.32775,0.188549,-0.026104,0.02516048,...,-0.032958,0.088321,-0.068965,-0.04429,-0.246051,-0.039128,0.104895,0.095666,-0.078686,0.06848
Num. of People,0.2242776,0.085008,0.098647,0.249745,0.224891,0.32775,1.0,0.259252,-0.040645,-0.02930016,...,-0.028627,0.24485,0.143749,-0.01152,-0.105808,-0.113685,-0.118213,0.003742,0.157867,0.235428
Travel Expenditure (USD),0.4417334,-0.007614,0.066325,0.316321,0.514348,0.188549,0.259252,1.0,-0.036225,0.01857,...,-0.054946,0.021597,0.174746,-0.005698,-0.066159,-0.228216,0.017523,-0.173316,0.184115,0.49322
Apr Average,0.02217316,0.434844,0.269726,-0.003891,-0.005238,-0.026104,-0.040645,-0.036225,1.0,-0.09159737,...,-0.089494,0.030485,-0.087357,-0.036014,-0.02538,-0.113264,0.164888,-0.034413,0.074435,-0.060495
Aug Average,-3.09067e-16,-0.185049,-0.356824,0.100651,0.068831,0.02516,-0.0293,0.01857,-0.091597,1.0,...,-0.091597,-0.059804,-0.08941,-0.03686,0.17983,-0.052661,-0.062049,0.038576,0.018841,-0.061916


In [65]:
from sklearn.metrics import r2_score
Y_pred= model1.predict(X_test)
accuracy= r2_score(Y_test, Y_pred)
accuracy

0.5445637874585609

In [66]:
import pickle
with open('expenditure2.pkl', 'wb') as file:
    pickle.dump(model1, file)