<h1  align ='center'> Predicting Price of uber trip </h1>

<h3 align = 'center'> Version 14  </h3>

## Introduction
- This is a project for predicting Uber trip price. As usual, the dataset we have is noisy and needs lots of feature engineering, and preprocessing.
- Now let's start working on a dataset in the Notebook. The first step is to import the libraries and load data. After that we will take a basic understanding of data like its shape, sample, is there are any NULL values present in the dataset. Understanding the data is an important step for prediction or any machine learning project.

## Project Agenda
- 1- Introduction
- 2- Load the Data
- 3- Data Assessing
- 4- Data Cleaning
- 5- Perfrom Exploratory Data Analysis
- 6- Feature Engineering
- 7- Model Selection 
- 8- Tesing The Selected Model


## 2- Loading the Data

In [None]:
#port Necessary libraries
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import seaborn as sns
import pandas as pd
import plotly.express as px
from sklearn import ensemble
from sklearn import metrics
import warnings
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
uber_df = pd.read_csv(r"../input/historicaluberdata/UberDataSet.csv")

## 3- Data Assessing 
__________________________________________________
- Data assessing is the step in which we evaluate our data.
### In this step:
- We take a quick look inside our data and the columns.
- Check if there are any missing values in the data so that we can handle them.
- Check if there are duplicated values.


In [None]:
uber_df.head(5)

In [None]:
uber_df.info()

In [None]:
uber_df.isnull().sum()

In [None]:
uber_df.columns

In [None]:
uber_df['cab_type'].value_counts()

In [None]:
uber_df.product_id.value_counts()

In [None]:
uber_df['name'].value_counts()

In [None]:
uber_df.shape

In [None]:
uber_df.duplicated().sum()

### 4- Data Cleaning
- We must drop   ('Unnamed: 0','DateTime','id','product_id','long_summary','short_summary','timestamp' ) Columns  because we don't need them
- Fix Missing Values in price column by Droping missing values in price column
- Change icon and name Columns name to Weather and Uber Service Type


In [None]:
uber_df.dropna(inplace=True)
#since we have one timezone so i will delete it  which America/New_York and also for cab_type since we have only uber 
uber_df.timezone.value_counts()

In [None]:
# Removeing  'latitude', 'longitude','source', 'destination','datetime','Unnamed: 0','index','id','product_id','long_summary','short_summary','timestamp','timezone','cab_type' Columns 
uber_df=uber_df.reset_index()
uber_df = uber_df.drop(['latitude', 'longitude','source', 'destination','datetime','Unnamed: 0','index','id','product_id','long_summary','short_summary','timestamp','timezone','cab_type'], axis=1 )
uber_df.info()

In [None]:
uber_df.shape

In [None]:
# Change icon and name Columns name to Weather and Uber Service Type
uber_df.rename(columns={'name': 'service_type','icon':'weather_condition'}, inplace=True)
uber_df.head(2)

In [None]:
uber_df.info()

In [None]:
uber_df.to_csv(r'CleanAndFilteredData.csv')


### 5 - Perform EDA(Exploratory Data Analysis)
- Exploratory analysis is a process to explore and understand the data and data relationship in a complete depth so that it makes feature engineering and machine learning modeling steps smooth and streamlined for prediction.



In [None]:
sns.distplot(uber_df['price'])

In [None]:
sns.barplot(x='service_type', y='price', data=uber_df)

In [None]:
uber_df['month'].value_counts().plot(kind='bar', figsize=(6,5), color=['#002080','#ff0066'],title = "Number of trips per Month ")

#### Great We have data For two months that is November and December

In [None]:
pie_df = uber_df.weather_condition.value_counts().reset_index()
pie_df.columns = ['condition', 'count']
# pie_df.head()
fig = px.pie(pie_df, values='count', names='condition', title='The proportion of number of trips in each weather condition', color_discrete_sequence=['#003f5c','#ffa600','#bc5090'], hole=0.2)
fig.show()

## 6- Feature Engineering
-----------------------------------------------------

What is a feature and why we need the engineering of it? Basically, all machine learning algorithms use some input data to create outputs. This input data comprise features, which are usually in the form of structured columns. Algorithms require features with some specific characteristic to work properly. Here, the need for feature engineering arises. 

I think feature engineering efforts mainly have two goals:

1) Preparing the proper input dataset, compatible with the machine learning algorithm requirements.

2) Improving the performance of machine learning models.

### 6.1-  Encoding Pandas Get dummy 

In [None]:
uber_df.head(1)

In [None]:
uber_df.service_type.value_counts()

In [None]:
uber_df.shape

In [None]:
uber_df = pd.get_dummies(uber_df,drop_first=False)

In [None]:
uber_df.head()

In [None]:
uber_df.shape

## 6.2- RFE (Recursive Feature Elimination) And R squared 
--------------------------------------
- Recursive Feature Elimination, or RFE for short, is a popular feature selection algorithm.

- RFE is popular because it is easy to configure and use and because it is effective at selecting those features (columns) in a training dataset that are  more or most relevant in predicting the target variable.

- RFE is divided into three parts; they are:

- Recursive Feature Elimination
- RFE With scikit-learn
    - RFE for Classification
    - RFE for Regression
- RFE Hyperparameters
   -  Explore Number of Features
    - Automatically Select the Number of Features
    - Which Features Were Selected
   -  Explore Base Algorithm
   
   
   
 - R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. Whereas correlation explains the strength of the relationship between an independent and dependent variable, R-squared explains to what extent the variance of one variable explains the variance of the second variable. So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model's inputs.

### 6.2.1- Load the necessary libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE

### 6.2.2- Define Dependent Variable 'y', independent variable 'X'  and dictionary of R_sqaured 

In [None]:
# Define Dependent Variable 'y'and independent variable 'X'
X = uber_df.drop('price', axis= 1)
y = uber_df['price']
print(X.shape)
print(y.shape)
R_sqaured = {}
Mean_SE = {}

### RFE (Recursive Feature Elimination) Function

In [None]:
#ٍSplitting data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)

def RFE_Function(i):
    reg = LinearRegression()
    rfe = RFE(reg , n_features_to_select = i)
    rfe = rfe.fit(X_train, y_train)
    selected_columns = X_train[X_train.columns[rfe.support_]]
    #Fitting training data
    reg = reg.fit(selected_columns, y_train)
    R_sqaured['Feature_'+str(i)]=reg.score(X_test[selected_columns.columns], y_test)
    y_pred = reg.predict(X_test[selected_columns.columns])
    Mean_SE['Feature_'+str(i)] = metrics.mean_squared_error(y_test,y_pred)
    return selected_columns 

#### Model 0 With All features 

In [None]:
X_55 = RFE_Function(55)
print("The R squared of 55 feature is ",R_sqaured['Feature_55'])
print("The MSE of 55 feature is ",Mean_SE['Feature_55'])
print(X_55.shape)

#### Model 1: with 48 features 


In [None]:
X_48 = RFE_Function(48)
print("The R squared of 48 feature is ",R_sqaured['Feature_48'])
print("The MSE of 48 feature is ",Mean_SE['Feature_48'])
print(X_48.shape)

#### Model 2: with 28 features using RFE

In [None]:
X_28 = RFE_Function(28)
print("The R squared of 28 feature is ",R_sqaured['Feature_28'])
print("The MSE of 28 feature is ",Mean_SE['Feature_28'])
print(X_28.shape)

#### Model 3: with 18 features using RFE

In [None]:
X_18= RFE_Function(18)
print("The R squared of 18 feature is ",R_sqaured['Feature_18'])
print("The MSE of 18 feature is ",Mean_SE['Feature_18'])
print(X_18.shape)

#### Model 4: with 8 features using RFE

In [None]:
X_8= RFE_Function(8)
print("The R squared of 8 feature is ",R_sqaured['Feature_8'])
print("The MSE of 8 feature is ",Mean_SE['Feature_8'])
print(X_8.shape)

#### Model 5: with 5 features using RFE

In [None]:
X_5= RFE_Function(5)
print("The R squared of 5 feature is ",R_sqaured['Feature_5'])
print("The MSE of 5 feature is ",Mean_SE['Feature_5'])
print(X_5.shape)

In [None]:
R_sqaured

In [None]:
Mean_SE

#### Chosing number of features 
* As the percentages are close to accuracy, which is 91 % , except when applying RFE with 5 Features  give us 80% , so I will choose training with 8 features. and MSE Also The same As the Value are close to 5.8 except when applying  RFE with 5 Features  give us 14.2

* It is clear from the differences that there is nothing influential other than the type of car and the distance so we will go with X_8

In [None]:
X_8.columns

In [None]:
X_8.drop(['temperatureMax'],axis= 1,inplace=True)
X_8.columns

In [None]:
X_train = X_8
X_test = X_test[X_8.columns]
print(X_train.shape)
print(X_test.shape)

## 7- Model Selection

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

#### 7.1 Linear Regression Model 

In [None]:
linear = LinearRegression()
linear.fit(X_train, y_train)
print(linear.score(X_test, y_test))
y_pred = linear.predict(X_test)
print('MSE :'," ", metrics.mean_squared_error(y_test,y_pred))
print('RMAE :'," ", np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

In [None]:
sns.kdeplot(y_pred, color="green", label='Predicted values') # Linear Regression Model prediction
sns.kdeplot(y_test, color="red", label='Actual values')
plt.title("Relation between Predicted and Actual value (Linear Regression Model Model)")
plt.legend()
plt.show()

#### 7.3 Random Forest Regressor

In [None]:
random = RandomForestRegressor(n_estimators = 100, random_state = 0) 
random.fit(X_train, y_train) 
print(random.score(X_test, y_test))
y_pred = random.predict(X_test)
print('MSE :'," ", metrics.mean_squared_error(y_test,y_pred))
print('RMAE :'," ", np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

In [None]:
sns.kdeplot(y_pred, color="green", label='Predicted values') # Random Forest prediction
sns.kdeplot(y_test, color="red", label='Actual values')
plt.title("Relation between Predicted and Actual value (Random Forest Model)")
plt.legend()
plt.show()

#### Random Forest Gived us 95%  so we will go with it

## 8- Testing
* K fold Cross Validation
* Testing For Random Forest Regressor

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
cv=ShuffleSplit(n_splits=5,test_size=0.2,random_state=0)
mse = cross_val_score(LinearRegression(),X_test,y_test,cv=cv , scoring='neg_mean_squared_error')
print(mse)
print(' Mean of All Folds  is',mse.mean() )

In [None]:
sns.kdeplot(y_pred, color="green", label='Predicted values') 
sns.kdeplot(y_test, color="red", label='Actual values')
plt.title("Relation between Predicted and Actual value (Random Forest Model)")
plt.legend()
plt.show()