### Project description

You're working as an analyst for Zuber, a new ride-sharing company that's launching in Chicago. Your task is to find patterns in the available information. You want to understand passenger preferences and the impact of external factors on rides.

Working with a database, you'll analyze data from competitors and test a hypothesis about the impact of weather on ride frequency.

## Step  Exploratory data analysis (Python)

In addition to the data you retrieved in the previous tasks, you've been given a second file. You now have these two CSVs:

[/datasets/project_sql_result_01.csv](https://practicum-content.s3.us-west-1.amazonaws.com/learning-materials/data-analyst-eng/moved_project_sql_result_01.csv). It contains the following data:

_company_name_: taxi company name

_trips_amount_: the number of rides for each taxi company on November 15-16, 2017.

[/datasets/project_sql_result_04.csv](https://practicum-content.s3.us-west-1.amazonaws.com/learning-materials/data-analyst-eng/moved_project_sql_result_04.csv). It contains the following data:

_dropoff_location_name_: Chicago neighborhoods where rides ended

_average_trips_: the average number of rides that ended in each neighborhood in November 2017.

#### - import the libraries

In [1]:
# import libraries
import pandas as pd
from scipy import stats as st
import numpy as np
import plotly.express as px

 * import the files

In [2]:
# load data sets
df_company_name = pd.read_csv('/datasets/project_sql_result_01.csv')
df_dropoff_location_name = pd.read_csv('/datasets/project_sql_result_04.csv')
df_rides_ohare = pd.read_csv('/datasets/project_sql_result_07.csv')

* study the data they contain

In [3]:

display(pd.concat([df_company_name[:10],df_dropoff_location_name[:10],df_rides_ohare[:10]],axis=1))

Unnamed: 0,company_name,trips_amount,dropoff_location_name,average_trips,start_ts,weather_conditions,duration_seconds
0,Flash Cab,19558,Loop,10727.466667,2017-11-25 16:00:00,Good,2410.0
1,Taxi Affiliation Services,11422,River North,9523.666667,2017-11-25 14:00:00,Good,1920.0
2,Medallion Leasing,10367,Streeterville,6664.666667,2017-11-25 12:00:00,Good,1543.0
3,Yellow Cab,9888,West Loop,5163.666667,2017-11-04 10:00:00,Good,2512.0
4,Taxi Affiliation Service Yellow,9299,O'Hare,2546.9,2017-11-11 07:00:00,Good,1440.0
5,Chicago Carriage Cab Corp,9181,Lake View,2420.966667,2017-11-11 04:00:00,Good,1320.0
6,City Service,8448,Grant Park,2068.533333,2017-11-04 16:00:00,Bad,2969.0
7,Sun Taxi,7701,Museum Campus,1510.0,2017-11-18 11:00:00,Good,2280.0
8,Star North Management LLC,7455,Gold Coast,1364.233333,2017-11-11 14:00:00,Good,2460.0
9,Blue Ribbon Taxi Association Inc.,5953,Sheffield & DePaul,1259.766667,2017-11-11 12:00:00,Good,2040.0


* make sure the data types are correct

In [4]:
print('df_company_name  \n',df_company_name.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_name  64 non-null     object
 1   trips_amount  64 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ KB
df_company_name  
 None


In [5]:
print('df_dropoff_location_name    ',df_dropoff_location_name.info() )

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   dropoff_location_name  94 non-null     object 
 1   average_trips          94 non-null     float64
dtypes: float64(1), object(1)
memory usage: 1.6+ KB
df_dropoff_location_name     None


In [6]:
print('df_rides_ohare  ', df_rides_ohare.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1068 entries, 0 to 1067
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   start_ts            1068 non-null   object 
 1   weather_conditions  1068 non-null   object 
 2   duration_seconds    1068 non-null   float64
dtypes: float64(1), object(2)
memory usage: 25.2+ KB
df_rides_ohare   None


Check duplicated

In [7]:
print('Total duplicates  ', df_rides_ohare.duplicated().sum())

Total duplicates   197


In [21]:
# remove duplicates
df_rides_ohare = df_rides_ohare.drop_duplicates()
print('Total duplicates  ', df_rides_ohare.duplicated().sum())

Total duplicates   0


<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 2</b> 
    
Perfect!

* identify the top 10 neighborhoods in terms of drop-offs

In [22]:
df_dropoff_location_name_top10  = df_dropoff_location_name.sort_values('average_trips', ascending = False)[:10]
display(df_dropoff_location_name_top10)

Unnamed: 0,dropoff_location_name,average_trips
0,Loop,10727.466667
1,River North,9523.666667
2,Streeterville,6664.666667
3,West Loop,5163.666667
4,O'Hare,2546.9
5,Lake View,2420.966667
6,Grant Park,2068.533333
7,Museum Campus,1510.0
8,Gold Coast,1364.233333
9,Sheffield & DePaul,1259.766667


Those datasets were loaded to pandas dataframe successfully, displayed the tables and run info() method to get full information about each dataset, and did not find any inconsistency information. The data already sort in descending order. The data type is consistent with the information recorder. In case of column start_ts from rides of the airport need to convert to date format before applying any operation.

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> 
    
Yep, very nice!
</div>

* make graphs: taxi companies and number of rides, top 10 neighborhoods by number of dropoffs

In [27]:
df_company_name_top10  = df_company_name.sort_values('trips_amount', ascending = False)[:15]

fig = px.bar(df_company_name_top10, y='company_name', x='trips_amount',
            title=" taxi companies and number of rides")
fig.show()

In [28]:
fig = px.bar(df_dropoff_location_name_top10, y='dropoff_location_name', x='average_trips',
            title=" top 10 neighborhoods by number of dropoffs")
fig.show()

In [25]:
df_rides_ohare_gb = df_rides_ohare.groupby('weather_conditions', as_index=False)['duration_seconds'].sum()
print(df_rides_ohare_gb)

fig = px.bar(df_rides_ohare_gb, y='duration_seconds', x='weather_conditions',
            title=" Duration rides below weather conditions")
fig.show()

  weather_conditions  duration_seconds
0                Bad          356566.0
1               Good         1469319.0


* draw conclusions based on each graph and explain the results

Three charts were displayed using the **plotly** library. 

The first bar chart indicates number trips by each taxi company and only displays the top 15th companies. A right-skewed distribution.

The second chart shows the neighborhoods and average trip, for top 10th neighborhoods. This chart helps find where there is more probability to get a tip.

The last chart shows the cumulative duration comparison by weather condition.  The user got a ride for a longer period or below good conditions.


<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> 
    
Also I could recommend you this site, may be you could find smth interesting for visualization skills:

https://www.python-graph-gallery.com/



## Step 5. Testing hypotheses (Python)

Is the average duration of the rides different when the weather condition is raining as the other condition?

In [12]:
df_rides_ohare['start_ts'] = pd.to_datetime(df_rides_ohare['start_ts'])

In [13]:
df_rides_ohare['dayweek'] = df_rides_ohare['start_ts'].dt.weekday

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> 
    
You also did not forget about the conversion of data types in dates</div>

In [29]:
ride_avgduration_satuday_rain = df_rides_ohare[(df_rides_ohare['dayweek'] == 5) & (df_rides_ohare['weather_conditions'] == 'Bad')]

In [31]:

display(ride_avgduration_satuday_rain[:5])

Unnamed: 0,start_ts,weather_conditions,duration_seconds,dayweek
6,2017-11-04 16:00:00,Bad,2969.0,5
30,2017-11-18 12:00:00,Bad,1980.0,5
34,2017-11-04 17:00:00,Bad,2460.0,5
51,2017-11-04 16:00:00,Bad,2760.0,5
52,2017-11-18 12:00:00,Bad,2460.0,5


In [32]:
ride_avgduration_all = df_rides_ohare[df_rides_ohare['weather_conditions'] == 'Good']

In [34]:
display(ride_avgduration_all[:5])

Unnamed: 0,start_ts,weather_conditions,duration_seconds,dayweek
0,2017-11-25 16:00:00,Good,2410.0,5
1,2017-11-25 14:00:00,Good,1920.0,5
2,2017-11-25 12:00:00,Good,1543.0,5
3,2017-11-04 10:00:00,Good,2512.0,5
4,2017-11-11 07:00:00,Good,1440.0,5


<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 2</b> 
    
Two dataframes were isolated

Plan Test 

Test whether the means of two independent samples are significantly  different. 
Assumptions Observations in each sample are independent and identically distributed (iid).
Observations in each sample are normally distributed. Observations in each sample have the same variance. 

Interpretation

H0: the means of the samples are equal. 

H1: the means of the samples are unequal.

selected the st.ttest_ind because is a tool to help compare the distribution of two population.

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 2</b> 
    
And hypotheses were formulated

In [35]:
alpha = 0.05  # critical statistical significance level
# if the p-value is less than alpha, we reject the hypothesis

results = st.ttest_ind(ride_avgduration_all['duration_seconds'], ride_avgduration_satuday_rain['duration_seconds'])

print('p-value: ', results.pvalue)

if results.pvalue < alpha:
    print("We reject the null hypothesis")
else:
    print("We can't reject the null hypothesis") 

p-value:  7.397770692813604e-08
We reject the null hypothesis


Null hypothesis was rejected and the data sets are no equal.

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 2</b> 
    
Absolutelly true!


After the analisis we are mange to get information was conclude in the taxi companies which are top one and the different on rider on period of time.

The top 10 neighborhoods by number dropoffs where are required more often.

The duration of ride comparison with weather conditions. On the chart it is clear that good weather conditions are more often.

The test hypothesis was rejected because the performance in weather conditions on saturday are not equal to general information.


<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 2</b> 

And the conclusion is absolutely correctly formulated throughout the work, great!

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>Review summary</b> 
    
Miguel, the project is great! You have very strong analytical skills, knowledge of research tools and understanding of statistical methods. But still there are a few comments in the project and I will ask you to correct them so that your project becomes even better!
    