**What is Data Analysis?**

Data Analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and 
support decision-making. It is a crucial step in any data-driven approach, helping organizations and individuals make informed decisions by interpreting 
data patterns, trends, and insights.

**Steps in Data Analysis:**

**Data Collection:** Gathering raw data from various sources such as databases, APIs, surveys, or logs.

**Data Cleaning:** Removing or correcting inaccuracies, duplicates, and inconsistencies in the data.

**Exploratory Data Analysis (EDA):** Summarizing the main characteristics of the data using statistical methods and visualization tools.

**Data Transformation:** Preparing the data for analysis by normalizing, aggregating, or structuring it appropriately.

**Analysis and Modeling:** Applying techniques like statistical methods, machine learning, or predictive modeling to extract insights.

**Visualization and Reporting:** Presenting the results through dashboards, charts, graphs, or reports to communicate findings effectively.

**Tools: Excel, Python (Pandas, NumPy, Matplotlib, Seaborn):**

**Applications of Data Analysis:**

**Business:** Market trend analysis, customer segmentation, and performance evaluation.

**Healthcare:** Patient diagnosis, medical research, and drug effectiveness studies.

**Finance:** Fraud detection, risk assessment, and investment strategies.

**Education:** Analyzing student performance and improving learning outcomes.

**Sports:** Player performance evaluation and game strategy optimization.

**Simple Scenario:**

A retail company wants to analyze its sales data to understand trends and improve sales performance.

**1. Data Collection**
    
**Example:** Collect sales data for the past year from the company’s point-of-sale (POS) system.

**Data Includes:**
  1. Date of sale
  2. Product category
  3. Quantity sold
  4. Revenue
  5. Customer demographics (age, location)
     
**Purpose:** Gather raw data that answers questions like "Which products sell the most?" or "What regions are underperforming?"

**2. Data Cleaning**

Example: Inspect the dataset for issues.
                            
1. Remove duplicate sales entries.
   
3. Correct inconsistencies in product names (e.g., "t-shirt" vs. "T-shirt").
   
5. Handle missing data, such as revenue values for some transactions.
   
Why?: Clean data ensures accurate and reliable analysis.

**3. Exploratory Data Analysis (EDA)**

**Example:** Use descriptive statistics and visualizations to explore the data.
1. Find the total sales revenue.
2. Identify which product categories generate the most revenue.
   
4. Plot sales trends over time (e.g., sales increase during the holiday season).
   
Tool: Use Python (Matplotlib, Pandas) or Excel to create charts and summaries.

**Outcome:**

"Electronics" is the top-selling category.

Sales peak in December and dip in February.


**4. Data Transformation**

**Example:** Prepare the data for deeper analysis.
  
1. Group data by month to analyze monthly trends.
2. Aggregate data by customer age groups to understand customer segmentation.

  Why?: It makes patterns and relationships easier to identify.

**5. Analysis and Modeling**

**Example:** Answer key business questions:

1. Use trend analysis to predict next year's sales during peak seasons.
2. Apply clustering to group customers by purchase behavior.
3. Perform a correlation analysis to check if discounts lead to higher sales.

**Outcome:**

1. Discounts are most effective for electronics during the holiday season.
2. Younger customers (ages 18–25) prefer fashion-related products.

**6. Visualization and Reporting**

**Example:** Present findings to the management team.
1. Create a bar chart showing monthly sales revenue.
2. Use a pie chart to represent sales by product category.
3. Build a dashboard in Tableau or Power BI for interactive exploration.

**Insights Shared:**
1. Focus on stocking electronics in December for maximum sales.
2. Offer targeted discounts for fashion products to younger customers.

In [148]:
import pandas as pd


In [149]:
uber_data=pd.read_csv("Uber.csv")

In [115]:
uber_data.head()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
1,1/2/2016 1:25,1/2/2016 1:37,Business,Fort Pierce,Fort Pierce,5.0,
2,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
3,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit


In [23]:
uber_data.tail(10)

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
1146,12/30/2016 11:31,12/30/2016 11:56,Business,Kar?chi,Kar?chi,2.9,Errand/Supplies
1147,12/30/2016 15:41,12/30/2016 16:03,Business,Kar?chi,Kar?chi,4.6,Errand/Supplies
1148,12/30/2016 16:45,12/30/2016 17:08,Business,Kar?chi,Kar?chi,4.6,Meeting
1149,12/30/2016 23:06,12/30/2016 23:10,Business,Kar?chi,Kar?chi,0.8,Customer Visit
1150,12/31/2016 1:07,12/31/2016 1:14,Business,Kar?chi,Kar?chi,0.7,Meeting
1151,12/31/2016 13:24,12/31/2016 13:42,Business,Kar?chi,Unknown Location,3.9,Temporary Site
1152,12/31/2016 15:03,12/31/2016 15:38,Business,Unknown Location,Unknown Location,16.2,Meeting
1153,12/31/2016 21:32,12/31/2016 21:50,Business,Katunayake,Gampaha,6.4,Temporary Site
1154,12/31/2016 22:08,12/31/2016 23:51,Business,Gampaha,Ilukwatta,48.2,Temporary Site
1155,Totals,,,,,12204.7,


In [8]:
print(uber_data.shape)

(1156, 7)


In [9]:
print(uber_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156 entries, 0 to 1155
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   START_DATE*  1156 non-null   object 
 1   END_DATE*    1155 non-null   object 
 2   CATEGORY*    1155 non-null   object 
 3   START*       1155 non-null   object 
 4   STOP*        1155 non-null   object 
 5   MILES*       1156 non-null   float64
 6   PURPOSE*     653 non-null    object 
dtypes: float64(1), object(6)
memory usage: 63.3+ KB
None


In [10]:
print(uber_data.describe())

             MILES*
count   1156.000000
mean      21.115398
std      359.299007
min        0.500000
25%        2.900000
50%        6.000000
75%       10.400000
max    12204.700000


In [11]:
print(uber_data.isnull().sum())

START_DATE*      0
END_DATE*        1
CATEGORY*        1
START*           1
STOP*            1
MILES*           0
PURPOSE*       503
dtype: int64


In [12]:
df=pd.read_csv("Uber.csv",sep=",",dtype={"miles":int},skiprows=1,nrows=5,na_values=["NA","Unknown"])
print(df)

   1/1/2016 21:11  1/1/2016 21:17  Business      Fort Pierce    Fort Pierce.1  \
0   1/2/2016 1:25   1/2/2016 1:37  Business      Fort Pierce      Fort Pierce   
1  1/2/2016 20:25  1/2/2016 20:38  Business      Fort Pierce      Fort Pierce   
2  1/5/2016 17:31  1/5/2016 17:45  Business      Fort Pierce      Fort Pierce   
3  1/6/2016 14:42  1/6/2016 15:49  Business      Fort Pierce  West Palm Beach   
4  1/6/2016 17:15  1/6/2016 17:19  Business  West Palm Beach  West Palm Beach   

    5.1   Meal/Entertain  
0   5.0              NaN  
1   4.8  Errand/Supplies  
2   4.7          Meeting  
3  63.7   Customer Visit  
4   4.3   Meal/Entertain  


In [16]:
df.iloc[0]

1/1/2016 21:11    1/2/2016 1:25
1/1/2016 21:17    1/2/2016 1:37
Business               Business
Fort Pierce         Fort Pierce
Fort Pierce.1       Fort Pierce
5.1                         5.0
Meal/Entertain              NaN
Name: 0, dtype: object

In [17]:
df.iloc[2:8,0:3]

Unnamed: 0,1/1/2016 21:11,1/1/2016 21:17,Business
2,1/5/2016 17:31,1/5/2016 17:45,Business
3,1/6/2016 14:42,1/6/2016 15:49,Business
4,1/6/2016 17:15,1/6/2016 17:19,Business


In [20]:
df.iloc[:,:-1]

Unnamed: 0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce.1,5.1
0,1/2/2016 1:25,1/2/2016 1:37,Business,Fort Pierce,Fort Pierce,5.0
1,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8
2,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7
3,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7
4,1/6/2016 17:15,1/6/2016 17:19,Business,West Palm Beach,West Palm Beach,4.3


In [19]:
df.iloc[:,-1:]

Unnamed: 0,Meal/Entertain
0,
1,Errand/Supplies
2,Meeting
3,Customer Visit
4,Meal/Entertain


In [27]:
temp=pd.DataFrame({'A':[1,2,3],'B':[10,20,30],'C':['2025-1-19','2025-7-30','2025-9-17']})
temp

Unnamed: 0,A,B,C
0,1,10,2025-1-19
1,2,20,2025-7-30
2,3,30,2025-9-17


In [28]:
temp.dtypes

A     int64
B     int64
C    object
dtype: object

In [29]:
temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       3 non-null      int64 
 1   B       3 non-null      int64 
 2   C       3 non-null      object
dtypes: int64(2), object(1)
memory usage: 204.0+ bytes


In [30]:
temp['C']=pd.to_datetime(temp['C'])
temp.dtypes

A             int64
B             int64
C    datetime64[ns]
dtype: object

In [35]:
temp1=pd.DataFrame({'name':['ravi','arjun','bhavika'],'age':[30,28,12],'date':['19-5-2025','30-8-2025','17-4-2025']})
temp1

Unnamed: 0,name,age,date
0,ravi,30,19-5-2025
1,arjun,28,30-8-2025
2,bhavika,12,17-4-2025


In [36]:
temp1.dtypes

name    object
age      int64
date    object
dtype: object

In [41]:
temp1['date']=pd.to_datetime(temp1['date'],format="%d/%m/%y")
temp1.dtypes

name            object
age              int64
date    datetime64[ns]
dtype: object

In [45]:
temp1['name'] = temp1['name'].astype(str)
temp1.dtypes

name            object
age              int64
date    datetime64[ns]
dtype: object

In [46]:
temp['B'] = temp['B'].astype(float)
temp.dtypes

A             int64
B           float64
C    datetime64[ns]
dtype: object

In [53]:
uber_data['START*'].value_counts()

START*
Cary                201
Unknown Location    148
Morrisville          85
Whitebridge          68
Islamabad            57
                   ... 
Florence              1
Ridgeland             1
Daytona Beach         1
Sky Lake              1
Gampaha               1
Name: count, Length: 177, dtype: int64

In [51]:
uber_data['START*'].value_counts().head(10)

START*
Cary                201
Unknown Location    148
Morrisville          85
Whitebridge          68
Islamabad            57
Durham               37
Lahore               36
Raleigh              28
Kar?chi              27
Westpark Place       17
Name: count, dtype: int64

In [64]:
a=uber_data[uber_data['MILES*']>50]
a

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit
232,3/17/2016 12:52,3/17/2016 15:11,Business,Austin,Katy,136.0,Customer Visit
251,3/19/2016 19:33,3/19/2016 20:39,Business,Galveston,Houston,57.0,Customer Visit
268,3/25/2016 13:24,3/25/2016 16:22,Business,Cary,Latta,144.0,Customer Visit
269,3/25/2016 16:52,3/25/2016 22:22,Business,Latta,Jacksonville,310.3,Customer Visit
270,3/25/2016 22:54,3/26/2016 1:39,Business,Jacksonville,Kissimmee,201.0,Meeting
295,4/2/2016 12:21,4/2/2016 14:47,Business,Kissimmee,Daytona Beach,77.3,Customer Visit
296,4/2/2016 16:57,4/2/2016 18:09,Business,Daytona Beach,Jacksonville,80.5,Customer Visit
297,4/2/2016 19:38,4/2/2016 22:36,Business,Jacksonville,Ridgeland,174.2,Customer Visit
298,4/2/2016 23:11,4/3/2016 1:34,Business,Ridgeland,Florence,144.0,Meeting


In [116]:
b = uber_data[(uber_data['MILES*'] > 50) & (uber_data['MILES*'] < 100)]
b

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit
251,3/19/2016 19:33,3/19/2016 20:39,Business,Galveston,Houston,57.0,Customer Visit
295,4/2/2016 12:21,4/2/2016 14:47,Business,Kissimmee,Daytona Beach,77.3,Customer Visit
296,4/2/2016 16:57,4/2/2016 18:09,Business,Daytona Beach,Jacksonville,80.5,Customer Visit
707,8/24/2016 13:01,8/24/2016 15:25,Business,Unknown Location,Unknown Location,96.2,
710,8/25/2016 17:19,8/25/2016 19:20,Business,Unknown Location,Unknown Location,50.4,
726,8/27/2016 14:01,8/27/2016 15:44,Business,Lahore,Unknown Location,86.6,
751,9/6/2016 17:49,9/6/2016 17:49,Business,Unknown Location,Unknown Location,69.1,
871,10/28/2016 20:13,10/28/2016 22:00,Business,Asheville,Topton,91.8,Meeting
873,10/29/2016 17:13,10/29/2016 19:19,Business,Hayesville,Topton,75.7,


In [120]:
p=uber_data.loc[uber_data['START*']=='Cary']
p

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
7,1/7/2016 13:27,1/7/2016 13:33,Business,Cary,Cary,0.8,Meeting
8,1/10/2016 8:05,1/10/2016 8:25,Business,Cary,Morrisville,8.3,Meeting
28,1/15/2016 11:43,1/15/2016 12:03,Business,Cary,Durham,10.4,Meal/Entertain
30,1/18/2016 14:55,1/18/2016 15:06,Business,Cary,Cary,4.8,Meal/Entertain
34,1/20/2016 10:36,1/20/2016 11:11,Business,Cary,Raleigh,17.1,Meeting
...,...,...,...,...,...,...,...
1049,12/13/2016 20:20,12/13/2016 20:29,Business,Cary,Cary,4.1,Meal/Entertain
1050,12/14/2016 16:52,12/14/2016 17:10,Business,Cary,Cary,3.4,
1051,12/14/2016 17:22,12/14/2016 17:34,Business,Cary,Cary,3.3,
1052,12/14/2016 17:50,12/14/2016 18:00,Business,Cary,Morrisville,3.0,Meal/Entertain


In [136]:
start=['Cary','Austin','Boone']
stop=['Topton','Cary','Florence']

In [138]:
display=uber_data.loc[(uber_data['START*'].isin(start)) & (uber_data['STOP*'].isin(stop))& (uber_data['MILES*']>10) & (uber_data['MILES*']<100)]
display

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
979,NaT,11/20/2016 11:32,Business,Cary,Cary,39.2,Between Offices
982,NaT,11/20/2016 18:37,Business,Cary,Cary,18.5,Errand/Supplies
990,NaT,11/22/2016 16:43,Business,Cary,Cary,12.7,Customer Visit
1035,2016-09-12 22:03:00,12/9/2016 22:57,Business,Cary,Cary,18.9,Customer Visit


In [121]:
n=uber_data.loc[uber_data['START*'].isin(['Cary','Florence'])]
n

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
7,1/7/2016 13:27,1/7/2016 13:33,Business,Cary,Cary,0.8,Meeting
8,1/10/2016 8:05,1/10/2016 8:25,Business,Cary,Morrisville,8.3,Meeting
28,1/15/2016 11:43,1/15/2016 12:03,Business,Cary,Durham,10.4,Meal/Entertain
30,1/18/2016 14:55,1/18/2016 15:06,Business,Cary,Cary,4.8,Meal/Entertain
34,1/20/2016 10:36,1/20/2016 11:11,Business,Cary,Raleigh,17.1,Meeting
...,...,...,...,...,...,...,...
1049,12/13/2016 20:20,12/13/2016 20:29,Business,Cary,Cary,4.1,Meal/Entertain
1050,12/14/2016 16:52,12/14/2016 17:10,Business,Cary,Cary,3.4,
1051,12/14/2016 17:22,12/14/2016 17:34,Business,Cary,Cary,3.3,
1052,12/14/2016 17:50,12/14/2016 18:00,Business,Cary,Morrisville,3.0,Meal/Entertain


In [122]:
new = uber_data.loc[(uber_data['START*'].isin(['Fort Pierce', 'Cary', 'Latta'])) & (uber_data['STOP*'].isin(['Latta', 'Katy', 'Jacksonville']))]
new


Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
268,3/25/2016 13:24,3/25/2016 16:22,Business,Cary,Latta,144.0,Customer Visit
269,3/25/2016 16:52,3/25/2016 22:22,Business,Latta,Jacksonville,310.3,Customer Visit


In [134]:
uber_data['START_DATE*'] = pd.to_datetime(uber_data['START_DATE*'], dayfirst=True)
r = uber_data[uber_data['START_DATE*'].dt.month == 1]
r

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
0,2016-01-01 21:11:00,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
61,2016-01-02 10:35:00,2/1/2016 11:15,Business,Cary,Chapel Hill,19.4,Customer Visit
62,2016-01-02 12:10:00,2/1/2016 12:43,Business,Chapel Hill,Cary,23.3,Customer Visit
63,2016-01-02 12:56:00,2/1/2016 13:07,Business,Northwoods,Whitebridge,3.9,Meal/Entertain
176,2016-01-03 18:47:00,3/1/2016 19:10,Business,Whitebridge,Wayne Ridge,8.0,Meal/Entertain
177,2016-01-03 21:27:00,3/1/2016 21:45,Business,Wayne Ridge,Whitebridge,8.0,Meeting
289,2016-01-04 13:43:00,4/1/2016 14:01,Business,Kissimmee,Kissimmee,11.0,Meeting
290,2016-01-04 14:36:00,4/1/2016 15:24,Business,Kissimmee,Orlando,15.5,Customer Visit
291,2016-01-04 16:01:00,4/1/2016 16:49,Business,Orlando,Kissimmee,20.3,Meeting
292,2016-01-04 16:52:00,4/1/2016 16:57,Personal,Kissimmee,Kissimmee,0.7,


In [143]:
uber_data['START_DATE*'] = pd.to_datetime(uber_data['START_DATE*'], errors='coerce', dayfirst=True)
uber_data['END_DATE*'] = pd.to_datetime(uber_data['END_DATE*'], errors='coerce', dayfirst=True)

In [144]:
uber_data['START*'] = uber_data['START*'].astype(str)

In [146]:
jan_2016_cary = uber_data[(uber_data['START_DATE*'].dt.month == 1) & (uber_data['START_DATE*'].dt.year == 2016) & (uber_data['START*'] == 'Cary')]
jan_2016_cary

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
61,2016-01-02 10:35:00,2016-01-02 11:15:00,Business,Cary,Chapel Hill,19.4,Customer Visit
392,2016-01-06 10:19:00,2016-01-06 10:47:00,Business,Cary,Morrisville,6.7,Customer Visit
501,2016-01-07 09:34:00,2016-01-07 09:57:00,Business,Cary,Raleigh,13.3,Meeting
503,2016-01-07 20:06:00,2016-01-07 20:24:00,Business,Cary,Durham,10.5,Meeting
615,2016-01-08 13:52:00,2016-01-08 14:14:00,Business,Cary,Apex,6.9,
618,2016-01-08 16:29:00,2016-01-08 16:59:00,Business,Cary,Morrisville,9.1,
887,2016-01-11 11:50:00,2016-01-11 12:27:00,Business,Cary,Durham,16.5,
1009,2016-01-12 07:44:00,2016-01-12 07:59:00,Business,Cary,Cary,5.5,Meeting
1010,2016-01-12 08:37:00,2016-01-12 08:53:00,Business,Cary,Cary,5.5,Errand/Supplies
1011,2016-01-12 18:00:00,2016-01-12 18:12:00,Business,Cary,Morrisville,2.9,Meal/Entertain


In [172]:
r.reset_index(inplace=True,drop=True)

In [173]:
r

Unnamed: 0,index,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
0,0,2016-01-01 21:11:00,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
1,1,2016-01-02 10:35:00,2/1/2016 11:15,Business,Cary,Chapel Hill,19.4,Customer Visit
2,2,2016-01-02 12:10:00,2/1/2016 12:43,Business,Chapel Hill,Cary,23.3,Customer Visit
3,3,2016-01-02 12:56:00,2/1/2016 13:07,Business,Northwoods,Whitebridge,3.9,Meal/Entertain
4,4,2016-01-03 18:47:00,3/1/2016 19:10,Business,Whitebridge,Wayne Ridge,8.0,Meal/Entertain
5,5,2016-01-03 21:27:00,3/1/2016 21:45,Business,Wayne Ridge,Whitebridge,8.0,Meeting
6,6,2016-01-04 13:43:00,4/1/2016 14:01,Business,Kissimmee,Kissimmee,11.0,Meeting
7,7,2016-01-04 14:36:00,4/1/2016 15:24,Business,Kissimmee,Orlando,15.5,Customer Visit
8,8,2016-01-04 16:01:00,4/1/2016 16:49,Business,Orlando,Kissimmee,20.3,Meeting
9,9,2016-01-04 16:52:00,4/1/2016 16:57,Personal,Kissimmee,Kissimmee,0.7,


In [164]:
r.iloc[10:35]

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
10,2016-01-05 13:45:00,5/1/2016 13:53,Business,Whitebridge,Westpark Place,2.1,Meal/Entertain
11,2016-01-05 14:26:00,5/1/2016 14:31,Business,Westpark Place,Whitebridge,2.3,
12,2016-01-05 17:33:00,5/1/2016 17:45,Business,Whitebridge,Tanglewood,6.2,Between Offices
13,2016-01-05 17:54:00,5/1/2016 18:10,Business,Tanglewood,Parkway,7.5,Meeting
14,2016-01-05 22:38:00,5/1/2016 22:49,Business,Parkway,Whitebridge,3.1,Errand/Supplies
15,2016-01-06 10:19:00,6/1/2016 10:47,Business,Cary,Morrisville,6.7,Customer Visit
16,2016-01-06 13:10:00,6/1/2016 13:39,Business,Morrisville,Cary,9.6,Meeting
17,2016-01-07 00:00:00,7/1/2016 0:25,Business,Durham,Cary,9.9,Meeting
18,2016-01-07 09:34:00,7/1/2016 9:57,Business,Cary,Raleigh,13.3,Meeting
19,2016-01-07 12:36:00,7/1/2016 13:00,Business,Raleigh,Cary,11.3,Meeting


In [175]:
uber_data.sort_values(by='MILES*')

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
420,6/8/2016 17:16,6/8/2016 17:18,Business,Soho,Tribeca,0.5,Errand/Supplies
44,1/26/2016 17:27,1/26/2016 17:29,Business,Cary,Cary,0.5,Errand/Supplies
120,2/17/2016 16:38,2/17/2016 16:43,Business,Katunayaka,Katunayaka,0.5,Errand/Supplies
1111,12/25/2016 0:10,12/25/2016 0:14,Business,Lahore,Lahore,0.6,Errand/Supplies
1110,12/24/2016 22:04,12/24/2016 22:09,Business,Lahore,Lahore,0.6,Errand/Supplies
...,...,...,...,...,...,...,...
776,9/27/2016 21:01,9/28/2016 2:37,Business,Unknown Location,Unknown Location,195.6,
881,10/30/2016 15:22,10/30/2016 18:23,Business,Asheville,Mebane,195.9,
270,3/25/2016 22:54,3/26/2016 1:39,Business,Jacksonville,Kissimmee,201.0,Meeting
269,3/25/2016 16:52,3/25/2016 22:22,Business,Latta,Jacksonville,310.3,Customer Visit


In [177]:
uber_data.sort_values(by='START*',ascending=False)

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
870,10/28/2016 18:13,10/28/2016 20:07,Business,Winston Salem,Asheville,133.6,Meeting
648,8/11/2016 12:53,8/11/2016 13:00,Business,Whitebridge,Heritage Pines,2.2,
72,2/4/2016 18:04,2/4/2016 18:31,Business,Whitebridge,Macgregor Downs,9.0,Meeting
889,11/1/2016 17:35,11/1/2016 17:42,Business,Whitebridge,Whitebridge,1.2,
459,6/24/2016 10:41,6/24/2016 10:57,Business,Whitebridge,Waverly Place,7.1,Meal/Entertain
...,...,...,...,...,...,...,...
906,11/4/2016 21:04,11/4/2016 21:20,Business,Agnew,Cory,4.3,
911,11/6/2016 10:50,11/6/2016 11:04,Business,Agnew,Renaissance,2.4,
910,11/5/2016 19:20,11/5/2016 19:28,Business,Agnew,Agnew,2.2,
908,11/5/2016 8:34,11/5/2016 8:43,Business,Agnew,Renaissance,2.2,


In [181]:
uber_data.sort_values(by=['START*','MILES*'],ascending=[True,False])

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
906,11/4/2016 21:04,11/4/2016 21:20,Business,Agnew,Cory,4.3,
911,11/6/2016 10:50,11/6/2016 11:04,Business,Agnew,Renaissance,2.4,
908,11/5/2016 8:34,11/5/2016 8:43,Business,Agnew,Renaissance,2.2,
910,11/5/2016 19:20,11/5/2016 19:28,Business,Agnew,Agnew,2.2,
879,10/30/2016 12:58,10/30/2016 13:18,Business,Almond,Bryson City,15.2,
...,...,...,...,...,...,...,...
889,11/1/2016 17:35,11/1/2016 17:42,Business,Whitebridge,Whitebridge,1.2,
890,11/1/2016 19:14,11/1/2016 19:20,Business,Whitebridge,Whitebridge,1.0,
516,7/5/2016 16:48,7/5/2016 16:52,Business,Whitebridge,Whitebridge,0.6,Errand/Supplies
870,10/28/2016 18:13,10/28/2016 20:07,Business,Winston Salem,Asheville,133.6,Meeting


In [186]:
import numpy as np
uber_data['MILES_CAT']=np.where(uber_data['MILES*']>100,"Long trip","short trip")
uber_data['nc']=10
uber_data.head(10)

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,MILES_CAT,nc
0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,short trip,10
1,1/2/2016 1:25,1/2/2016 1:37,Business,Fort Pierce,Fort Pierce,5.0,,short trip,10
2,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,short trip,10
3,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting,short trip,10
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,short trip,10
5,1/6/2016 17:15,1/6/2016 17:19,Business,West Palm Beach,West Palm Beach,4.3,Meal/Entertain,short trip,10
6,1/6/2016 17:30,1/6/2016 17:35,Business,West Palm Beach,Palm Beach,7.1,Meeting,short trip,10
7,1/7/2016 13:27,1/7/2016 13:33,Business,Cary,Cary,0.8,Meeting,short trip,10
8,1/10/2016 8:05,1/10/2016 8:25,Business,Cary,Morrisville,8.3,Meeting,short trip,10
9,1/10/2016 12:17,1/10/2016 12:44,Business,Jamaica,New York,16.5,Customer Visit,short trip,10


In [198]:
uber_data['MILES_C'] = np.where(uber_data['MILES*'] < 100, "short trip", np.where((uber_data['MILES*'] >= 100) & (uber_data['MILES*'] < 200), "m", "long trip"))

In [199]:
uber_data.head(10)

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,MILES_CAT,nc,MILES_C
0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,short trip,10,short trip
1,1/2/2016 1:25,1/2/2016 1:37,Business,Fort Pierce,Fort Pierce,5.0,,short trip,10,short trip
2,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,short trip,10,short trip
3,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting,short trip,10,short trip
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,short trip,10,short trip
5,1/6/2016 17:15,1/6/2016 17:19,Business,West Palm Beach,West Palm Beach,4.3,Meal/Entertain,short trip,10,short trip
6,1/6/2016 17:30,1/6/2016 17:35,Business,West Palm Beach,Palm Beach,7.1,Meeting,short trip,10,short trip
7,1/7/2016 13:27,1/7/2016 13:33,Business,Cary,Cary,0.8,Meeting,short trip,10,short trip
8,1/10/2016 8:05,1/10/2016 8:25,Business,Cary,Morrisville,8.3,Meeting,short trip,10,short trip
9,1/10/2016 12:17,1/10/2016 12:44,Business,Jamaica,New York,16.5,Customer Visit,short trip,10,short trip


In [203]:
long_trip_data = uber_data[uber_data['MILES_C'] == 'long trip']
long_trip_data

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,MILES_CAT,nc,MILES_C
269,3/25/2016 16:52,3/25/2016 22:22,Business,Latta,Jacksonville,310.3,Customer Visit,Long trip,10,long trip
270,3/25/2016 22:54,3/26/2016 1:39,Business,Jacksonville,Kissimmee,201.0,Meeting,Long trip,10,long trip
1155,Totals,,,,,12204.7,,Long trip,10,long trip


In [204]:
long_trip_data = uber_data[uber_data['MILES_C'] == 'long trip'].shape[0]
long_trip_data

3

In [208]:
uber_data.groupby('START*')['MILES*'].agg('mean')

START*
Agnew                2.775000
Almond              15.200000
Apex                 5.341176
Arabi               17.000000
Arlington            4.900000
                      ...    
West University      2.200000
Weston               4.000000
Westpark Place       2.182353
Whitebridge          4.020588
Winston Salem      133.600000
Name: MILES*, Length: 177, dtype: float64