**What is Data Analysis?**

Data Analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and 
support decision-making. It is a crucial step in any data-driven approach, helping organizations and individuals make informed decisions by interpreting 
data patterns, trends, and insights.

**Steps in Data Analysis:**

**Data Collection:** Gathering raw data from various sources such as databases, APIs, surveys, or logs.

**Data Cleaning:** Removing or correcting inaccuracies, duplicates, and inconsistencies in the data.

**Exploratory Data Analysis (EDA):** Summarizing the main characteristics of the data using statistical methods and visualization tools.

**Data Transformation:** Preparing the data for analysis by normalizing, aggregating, or structuring it appropriately.

**Analysis and Modeling:** Applying techniques like statistical methods, machine learning, or predictive modeling to extract insights.

**Visualization and Reporting:** Presenting the results through dashboards, charts, graphs, or reports to communicate findings effectively.

**Tools: Excel, Python (Pandas, NumPy, Matplotlib, Seaborn):**

**Applications of Data Analysis:**

**Business:** Market trend analysis, customer segmentation, and performance evaluation.

**Healthcare:** Patient diagnosis, medical research, and drug effectiveness studies.

**Finance:** Fraud detection, risk assessment, and investment strategies.

**Education:** Analyzing student performance and improving learning outcomes.

**Sports:** Player performance evaluation and game strategy optimization.

**Simple Scenario:**

A retail company wants to analyze its sales data to understand trends and improve sales performance.

**1. Data Collection**
    
**Example:** Collect sales data for the past year from the company’s point-of-sale (POS) system.

**Data Includes:**
  1. Date of sale
  2. Product category
  3. Quantity sold
  4. Revenue
  5. Customer demographics (age, location)
     
**Purpose:** Gather raw data that answers questions like "Which products sell the most?" or "What regions are underperforming?"

**2. Data Cleaning**

Example: Inspect the dataset for issues.
                            
1. Remove duplicate sales entries.
   
3. Correct inconsistencies in product names (e.g., "t-shirt" vs. "T-shirt").
   
5. Handle missing data, such as revenue values for some transactions.
   
Why?: Clean data ensures accurate and reliable analysis.

**3. Exploratory Data Analysis (EDA)**

**Example:** Use descriptive statistics and visualizations to explore the data.
1. Find the total sales revenue.
2. Identify which product categories generate the most revenue.
   
4. Plot sales trends over time (e.g., sales increase during the holiday season).
   
Tool: Use Python (Matplotlib, Pandas) or Excel to create charts and summaries.

**Outcome:**

"Electronics" is the top-selling category.

Sales peak in December and dip in February.


**4. Data Transformation**

**Example:** Prepare the data for deeper analysis.
  
1. Group data by month to analyze monthly trends.
2. Aggregate data by customer age groups to understand customer segmentation.

  Why?: It makes patterns and relationships easier to identify.

**5. Analysis and Modeling**

**Example:** Answer key business questions:

1. Use trend analysis to predict next year's sales during peak seasons.
2. Apply clustering to group customers by purchase behavior.
3. Perform a correlation analysis to check if discounts lead to higher sales.

**Outcome:**

1. Discounts are most effective for electronics during the holiday season.
2. Younger customers (ages 18–25) prefer fashion-related products.

**6. Visualization and Reporting**

**Example:** Present findings to the management team.
1. Create a bar chart showing monthly sales revenue.
2. Use a pie chart to represent sales by product category.
3. Build a dashboard in Tableau or Power BI for interactive exploration.

**Insights Shared:**
1. Focus on stocking electronics in December for maximum sales.
2. Offer targeted discounts for fashion products to younger customers.

In [5]:
import pandas as pd
uber_data=pd.read_csv("Uber.csv")
uber_data.head()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
1,1/2/2016 1:25,1/2/2016 1:37,Business,Fort Pierce,Fort Pierce,5.0,
2,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
3,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit


In [3]:
print(uber_data.shape)
print(uber_data.info())
print(uber_data.describe())

(1156, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156 entries, 0 to 1155
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   START_DATE*  1156 non-null   object 
 1   END_DATE*    1155 non-null   object 
 2   CATEGORY*    1155 non-null   object 
 3   START*       1155 non-null   object 
 4   STOP*        1155 non-null   object 
 5   MILES*       1156 non-null   float64
 6   PURPOSE*     653 non-null    object 
dtypes: float64(1), object(6)
memory usage: 63.3+ KB
None
             MILES*
count   1156.000000
mean      21.115398
std      359.299007
min        0.500000
25%        2.900000
50%        6.000000
75%       10.400000
max    12204.700000


In [4]:
print(uber_data.isnull().sum())

START_DATE*      0
END_DATE*        1
CATEGORY*        1
START*           1
STOP*            1
MILES*           0
PURPOSE*       503
dtype: int64


In [14]:
import pandas as pd
df = pd.read_csv(
    "uber.csv",
    sep=",", C
    dtype={"MILES": int}, 
    skiprows=1,
    nrows=5,
    na_values=["NA", "Unknown"]  
)
df

Unnamed: 0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce.1,5.1,Meal/Entertain
0,1/2/2016 1:25,1/2/2016 1:37,Business,Fort Pierce,Fort Pierce,5.0,
1,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
2,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting
3,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit
4,1/6/2016 17:15,1/6/2016 17:19,Business,West Palm Beach,West Palm Beach,4.3,Meal/Entertain


In [15]:
uber_data.head()
uber_data.tail()
uber_data.iloc[0]
uber_data.iloc[2:8]
uber_data.iloc[2:8,1:3]
uber_data.iloc[20:41,0:4]

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*
20,1/12/2016 15:13,1/12/2016 15:28,Business,Hudson Square
21,1/12/2016 15:42,1/12/2016 15:54,Business,Hell's Kitchen
22,1/12/2016 16:02,1/12/2016 17:00,Business,New York
23,1/13/2016 13:54,1/13/2016 14:07,Business,Downtown
24,1/13/2016 15:00,1/13/2016 15:28,Business,Gulfton
25,1/14/2016 16:29,1/14/2016 17:05,Business,Houston
26,1/14/2016 21:39,1/14/2016 21:45,Business,Eagan Park
27,1/15/2016 0:41,1/15/2016 1:01,Business,Morrisville
28,1/15/2016 11:43,1/15/2016 12:03,Business,Cary
29,1/15/2016 13:26,1/15/2016 13:44,Business,Durham


In [1]:
import pandas as pd
temp=pd.DataFrame({'A':[1,2,3,4],'B':[10,20,30,40],'C':['2025-1-25','2024-6-13','2023-4-25','2025-3-19']})

In [3]:
temp.dtypes

A     int64
B     int64
C    object
dtype: object

In [4]:
temp['C']=pd.to_datetime(temp['C'])
temp.dtypes

A             int64
B             int64
C    datetime64[ns]
dtype: object

In [5]:
import pandas as pd
templ=pd.DataFrame({'A':[1,2,3,4],'B':[10,20,30,40],'C':['25-1-1025','13-6-2024','2023-4-25','19-3-2025']})
temp.dtypes

A             int64
B             int64
C    datetime64[ns]
dtype: object

In [7]:
templ['C']=pd.to_datetime(temp['C'],format="%d-%m-%y")
templ.dtypes

A             int64
B             int64
C    datetime64[ns]
dtype: object

In [8]:
templ


Unnamed: 0,A,B,C
0,1,10,2025-01-25
1,2,20,2024-06-13
2,3,30,2023-04-25
3,4,40,2025-03-19


In [10]:
templ['C']=pd.to_datetime(temp['C'],format="%d/%m/%y")
templ.dtypes
templ

Unnamed: 0,A,B,C
0,1,10,2025-01-25
1,2,20,2024-06-13
2,3,30,2023-04-25
3,4,40,2025-03-19


In [11]:
templ['A'] = templ['A'].astype(str)
templ.dtypes

A            object
B             int64
C    datetime64[ns]
dtype: object

In [12]:
templ['B'] = templ['B'].astype(float)
templ.dtypes

A            object
B           float64
C    datetime64[ns]
dtype: object

In [15]:
import pandas as pd
uber_data=pd.read_csv("Uber.csv")
uber_data.head()


Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
1,1/2/2016 1:25,1/2/2016 1:37,Business,Fort Pierce,Fort Pierce,5.0,
2,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
3,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit


In [20]:
# Strip any leading/trailing spaces from the column names
uber_data.columns = uber_data.columns.str.strip()
unique_cities = pd.concat([uber_data['START*'], uber_data['STOP*']]).unique()
print(unique_cities)



['Fort Pierce' 'West Palm Beach' 'Cary' 'Jamaica' 'New York' 'Elmhurst'
 'Midtown' 'East Harlem' 'Flatiron District' 'Midtown East'
 'Hudson Square' 'Lower Manhattan' "Hell's Kitchen" 'Downtown' 'Gulfton'
 'Houston' 'Eagan Park' 'Morrisville' 'Durham' 'Farmington Woods'
 'Whitebridge' 'Lake Wellingborough' 'Fayetteville Street' 'Raleigh'
 'Hazelwood' 'Fairmont' 'Meredith Townes' 'Apex' 'Chapel Hill'
 'Northwoods' 'Edgehill Farms' 'Tanglewood' 'Preston' 'Eastgate'
 'East Elmhurst' 'Jackson Heights' 'Long Island City' 'Katunayaka'
 'Unknown Location' 'Colombo' 'Nugegoda' 'Islamabad' 'R?walpindi'
 'Noorpur Shahan' 'Heritage Pines' 'Westpark Place' 'Waverly Place'
 'Wayne Ridge' 'Weston' 'East Austin' 'West University' 'South Congress'
 'The Drag' 'Congress Ave District' 'Red River District' 'Georgian Acres'
 'North Austin' 'Coxville' 'Convention Center District' 'Austin' 'Katy'
 'Sharpstown' 'Sugar Land' 'Galveston' 'Port Bolivar' 'Washington Avenue'
 'Briar Meadow' 'Latta' 'Jacksonville'

In [22]:
uber_data['START*'].value_counts()

START*
Cary                201
Unknown Location    148
Morrisville          85
Whitebridge          68
Islamabad            57
                   ... 
Florence              1
Ridgeland             1
Daytona Beach         1
Sky Lake              1
Gampaha               1
Name: count, Length: 177, dtype: int64

In [24]:
a=uber_data[uber_data['MILES*']>6]
print(a)

           START_DATE*         END_DATE* CATEGORY*            START*  \
4       1/6/2016 14:42    1/6/2016 15:49  Business       Fort Pierce   
6       1/6/2016 17:30    1/6/2016 17:35  Business   West Palm Beach   
8       1/10/2016 8:05    1/10/2016 8:25  Business              Cary   
9      1/10/2016 12:17   1/10/2016 12:44  Business           Jamaica   
10     1/10/2016 15:08   1/10/2016 15:51  Business          New York   
...                ...               ...       ...               ...   
1144  12/29/2016 23:14  12/29/2016 23:47  Business  Unknown Location   
1152  12/31/2016 15:03  12/31/2016 15:38  Business  Unknown Location   
1153  12/31/2016 21:32  12/31/2016 21:50  Business        Katunayake   
1154  12/31/2016 22:08  12/31/2016 23:51  Business           Gampaha   
1155            Totals               NaN       NaN               NaN   

                 STOP*   MILES*        PURPOSE*  
4      West Palm Beach     63.7  Customer Visit  
6           Palm Beach      7.1    

In [25]:
b = uber_data[(uber_data['MILES*'] >= 50) & (uber_data['MILES*'] <= 100)]
print(b)


          START_DATE*         END_DATE* CATEGORY*            START*  \
4      1/6/2016 14:42    1/6/2016 15:49  Business       Fort Pierce   
251   3/19/2016 19:33   3/19/2016 20:39  Business         Galveston   
295    4/2/2016 12:21    4/2/2016 14:47  Business         Kissimmee   
296    4/2/2016 16:57    4/2/2016 18:09  Business     Daytona Beach   
707   8/24/2016 13:01   8/24/2016 15:25  Business  Unknown Location   
710   8/25/2016 17:19   8/25/2016 19:20  Business  Unknown Location   
726   8/27/2016 14:01   8/27/2016 15:44  Business            Lahore   
751    9/6/2016 17:49    9/6/2016 17:49  Business  Unknown Location   
871  10/28/2016 20:13  10/28/2016 22:00  Business         Asheville   
873  10/29/2016 17:13  10/29/2016 19:19  Business        Hayesville   
880  10/30/2016 13:24  10/30/2016 14:37  Business       Bryson City   

                STOP*  MILES*        PURPOSE*  
4     West Palm Beach    63.7  Customer Visit  
251           Houston    57.0  Customer Visit  
295

In [26]:
c = uber_data[uber_data['MILES*'] >= 50]['MILES*']
print(c)


4          63.7
232       136.0
251        57.0
268       144.0
269       310.3
270       201.0
295        77.3
296        80.5
297       174.2
298       144.0
299       159.3
546       195.3
559       180.2
707        96.2
710        50.4
726        86.6
727       156.9
751        69.1
776       195.6
788       112.6
869       107.0
870       133.6
871        91.8
873        75.7
880        68.4
881       195.9
1088      103.0
1155    12204.7
Name: MILES*, dtype: float64


In [28]:
d = uber_data[uber_data['MILES*'] >= 50][['MILES*', 'START*', 'STOP*']]
print(d)


       MILES*            START*             STOP*
4        63.7       Fort Pierce   West Palm Beach
232     136.0            Austin              Katy
251      57.0         Galveston           Houston
268     144.0              Cary             Latta
269     310.3             Latta      Jacksonville
270     201.0      Jacksonville         Kissimmee
295      77.3         Kissimmee     Daytona Beach
296      80.5     Daytona Beach      Jacksonville
297     174.2      Jacksonville         Ridgeland
298     144.0         Ridgeland          Florence
299     159.3          Florence              Cary
546     195.3       Morrisville        Banner Elk
559     180.2             Boone              Cary
707      96.2  Unknown Location  Unknown Location
710      50.4  Unknown Location  Unknown Location
726      86.6            Lahore  Unknown Location
727     156.9  Unknown Location  Unknown Location
751      69.1  Unknown Location  Unknown Location
776     195.6  Unknown Location  Unknown Location


In [29]:
# Step 1: Find the unique cities in the 'START*' column
unique_cities = uber_data['START*'].unique()

# Step 2: Select the first 3 unique cities (or any 3 cities you want)
cities_to_select = unique_cities[:3]  # Select first 3 unique cities (change this as needed)

# Step 3: Filter the data based on these 3 cities
filtered_data = uber_data[uber_data['START*'].isin(cities_to_select)]

# Step 4: Print the 'START*' and related columns (if you want 'START*' and 'STOP*')
print(filtered_data[['START*', 'STOP*']])


           START*            STOP*
0     Fort Pierce      Fort Pierce
1     Fort Pierce      Fort Pierce
2     Fort Pierce      Fort Pierce
3     Fort Pierce      Fort Pierce
4     Fort Pierce  West Palm Beach
...           ...              ...
1049         Cary             Cary
1050         Cary             Cary
1051         Cary             Cary
1052         Cary      Morrisville
1054         Cary      Morrisville

[208 rows x 2 columns]


In [31]:
uber_data.loc[uber_data['START*'].isin(['New York', 'Cary', 'Colombo'])]


Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
7,1/7/2016 13:27,1/7/2016 13:33,Business,Cary,Cary,0.8,Meeting
8,1/10/2016 8:05,1/10/2016 8:25,Business,Cary,Morrisville,8.3,Meeting
10,1/10/2016 15:08,1/10/2016 15:51,Business,New York,Queens,10.8,Meeting
22,1/12/2016 16:02,1/12/2016 17:00,Business,New York,Queens County,15.1,Meeting
28,1/15/2016 11:43,1/15/2016 12:03,Business,Cary,Durham,10.4,Meal/Entertain
...,...,...,...,...,...,...,...
1049,12/13/2016 20:20,12/13/2016 20:29,Business,Cary,Cary,4.1,Meal/Entertain
1050,12/14/2016 16:52,12/14/2016 17:10,Business,Cary,Cary,3.4,
1051,12/14/2016 17:22,12/14/2016 17:34,Business,Cary,Cary,3.3,
1052,12/14/2016 17:50,12/14/2016 18:00,Business,Cary,Morrisville,3.0,Meal/Entertain


In [36]:
filter_data = uber_data[(uber_data['START*'].isin(['New York', 'Cary', 'Colombo'])) & 
                        (uber_data['STOP*'].isin(['Queens', 'Cary', 'Morrisville'])) & 
                        (uber_data['MILES*'] >= 10) &(uber_data['MILES*'] <= 20)]
print(filter_data)


           START_DATE*         END_DATE* CATEGORY*    START*        STOP*  \
10     1/10/2016 15:08   1/10/2016 15:51  Business  New York       Queens   
982   11/20/2016 17:45  11/20/2016 18:37  Business      Cary         Cary   
990   11/22/2016 15:51  11/22/2016 16:43  Business      Cary         Cary   
1035   12/9/2016 22:03   12/9/2016 22:57  Business      Cary         Cary   
1054  12/15/2016 14:20  12/15/2016 14:54  Business      Cary  Morrisville   

      MILES*         PURPOSE*  
10      10.8          Meeting  
982     18.5  Errand/Supplies  
990     12.7   Customer Visit  
1035    18.9   Customer Visit  
1054    10.6          Meeting  


In [40]:
print(uber_data[uber_data['START_DATE*'].dt.month == 1][['START_DATE*','START*', 'STOP*']])


           START_DATE*       START*            STOP*
0  2016-01-01 21:11:00  Fort Pierce      Fort Pierce
1  2016-01-02 01:25:00  Fort Pierce      Fort Pierce
2  2016-01-02 20:25:00  Fort Pierce      Fort Pierce
3  2016-01-05 17:31:00  Fort Pierce      Fort Pierce
4  2016-01-06 14:42:00  Fort Pierce  West Palm Beach
..                 ...          ...              ...
56 2016-01-29 13:24:00       Durham             Cary
57 2016-01-29 18:31:00         Cary             Apex
58 2016-01-29 21:21:00         Apex             Cary
59 2016-01-30 16:21:00         Cary             Apex
60 2016-01-30 18:09:00         Apex             Cary

[61 rows x 3 columns]


In [45]:
print(uber_data[(uber_data['START_DATE*'].dt.month == 1) &
                uber_data['START*'].isin(['Cary'])][['START_DATE*','START*','STOP*']])


           START_DATE* START*        STOP*
7  2016-01-07 13:27:00   Cary         Cary
8  2016-01-10 08:05:00   Cary  Morrisville
28 2016-01-15 11:43:00   Cary       Durham
30 2016-01-18 14:55:00   Cary         Cary
34 2016-01-20 10:36:00   Cary      Raleigh
37 2016-01-21 14:25:00   Cary         Cary
38 2016-01-21 14:43:00   Cary         Cary
39 2016-01-21 16:01:00   Cary         Cary
43 2016-01-26 17:17:00   Cary         Cary
44 2016-01-26 17:27:00   Cary         Cary
45 2016-01-27 09:24:00   Cary         Cary
46 2016-01-27 10:19:00   Cary      Raleigh
50 2016-01-28 12:28:00   Cary      Raleigh
53 2016-01-29 09:31:00   Cary         Cary
54 2016-01-29 10:56:00   Cary         Cary
55 2016-01-29 11:43:00   Cary       Durham
57 2016-01-29 18:31:00   Cary         Apex
59 2016-01-30 16:21:00   Cary         Apex


In [50]:
filtered_data = uber_data[
    (uber_data['START_DATE*'].dt.month == 1) & 
    (uber_data['END_DATE*'].dt.month == 1) & 
    uber_data['START*'] == 'Cary'
][['START_DATE*', 'START*', 'STOP*']]

print(filtered_data)


Empty DataFrame
Columns: [START_DATE*, START*, STOP*]
Index: []


In [9]:
uber_data['START_DATE*'] = pd.to_datetime(uber_data['START_DATE*'], errors='coerce')
uber_data['END_DATE*'] = pd.to_datetime(uber_data['END_DATE*'], errors='coerce')

cleaned_data = uber_data[
    (uber_data['START_DATE*'].dt.month == 1) & (uber_data['END_DATE*'].dt.month == 1) & 
    (uber_data['START*'] == 'Cary')
][['START_DATE*', 'START*', 'STOP*']]

print(cleaned_data)


           START_DATE* START*        STOP*
7  2016-01-07 13:27:00   Cary         Cary
8  2016-01-10 08:05:00   Cary  Morrisville
28 2016-01-15 11:43:00   Cary       Durham
30 2016-01-18 14:55:00   Cary         Cary
34 2016-01-20 10:36:00   Cary      Raleigh
37 2016-01-21 14:25:00   Cary         Cary
38 2016-01-21 14:43:00   Cary         Cary
39 2016-01-21 16:01:00   Cary         Cary
43 2016-01-26 17:17:00   Cary         Cary
44 2016-01-26 17:27:00   Cary         Cary
45 2016-01-27 09:24:00   Cary         Cary
46 2016-01-27 10:19:00   Cary      Raleigh
50 2016-01-28 12:28:00   Cary      Raleigh
53 2016-01-29 09:31:00   Cary         Cary
54 2016-01-29 10:56:00   Cary         Cary
55 2016-01-29 11:43:00   Cary       Durham
57 2016-01-29 18:31:00   Cary         Apex
59 2016-01-30 16:21:00   Cary         Apex


In [10]:
cleaned_data.reset_index(inplace=True, drop=True)
print(cleaned_data)

           START_DATE* START*        STOP*
0  2016-01-07 13:27:00   Cary         Cary
1  2016-01-10 08:05:00   Cary  Morrisville
2  2016-01-15 11:43:00   Cary       Durham
3  2016-01-18 14:55:00   Cary         Cary
4  2016-01-20 10:36:00   Cary      Raleigh
5  2016-01-21 14:25:00   Cary         Cary
6  2016-01-21 14:43:00   Cary         Cary
7  2016-01-21 16:01:00   Cary         Cary
8  2016-01-26 17:17:00   Cary         Cary
9  2016-01-26 17:27:00   Cary         Cary
10 2016-01-27 09:24:00   Cary         Cary
11 2016-01-27 10:19:00   Cary      Raleigh
12 2016-01-28 12:28:00   Cary      Raleigh
13 2016-01-29 09:31:00   Cary         Cary
14 2016-01-29 10:56:00   Cary         Cary
15 2016-01-29 11:43:00   Cary       Durham
16 2016-01-29 18:31:00   Cary         Apex
17 2016-01-30 16:21:00   Cary         Apex


In [11]:
uber_data.sort_values(by='MILES*')


Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
420,2016-06-08 17:16:00,2016-06-08 17:18:00,Business,Soho,Tribeca,0.5,Errand/Supplies
44,2016-01-26 17:27:00,2016-01-26 17:29:00,Business,Cary,Cary,0.5,Errand/Supplies
120,2016-02-17 16:38:00,2016-02-17 16:43:00,Business,Katunayaka,Katunayaka,0.5,Errand/Supplies
1111,2016-12-25 00:10:00,2016-12-25 00:14:00,Business,Lahore,Lahore,0.6,Errand/Supplies
1110,2016-12-24 22:04:00,2016-12-24 22:09:00,Business,Lahore,Lahore,0.6,Errand/Supplies
...,...,...,...,...,...,...,...
776,2016-09-27 21:01:00,2016-09-28 02:37:00,Business,Unknown Location,Unknown Location,195.6,
881,2016-10-30 15:22:00,2016-10-30 18:23:00,Business,Asheville,Mebane,195.9,
270,2016-03-25 22:54:00,2016-03-26 01:39:00,Business,Jacksonville,Kissimmee,201.0,Meeting
269,2016-03-25 16:52:00,2016-03-25 22:22:00,Business,Latta,Jacksonville,310.3,Customer Visit


In [12]:
uber_data.sort_values(by='MILES*',ascending=False)

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
1155,NaT,NaT,,,,12204.7,
269,2016-03-25 16:52:00,2016-03-25 22:22:00,Business,Latta,Jacksonville,310.3,Customer Visit
270,2016-03-25 22:54:00,2016-03-26 01:39:00,Business,Jacksonville,Kissimmee,201.0,Meeting
881,2016-10-30 15:22:00,2016-10-30 18:23:00,Business,Asheville,Mebane,195.9,
776,2016-09-27 21:01:00,2016-09-28 02:37:00,Business,Unknown Location,Unknown Location,195.6,
...,...,...,...,...,...,...,...
1121,2016-12-27 12:53:00,2016-12-27 12:57:00,Business,Kar?chi,Kar?chi,0.6,Meal/Entertain
1110,2016-12-24 22:04:00,2016-12-24 22:09:00,Business,Lahore,Lahore,0.6,Errand/Supplies
44,2016-01-26 17:27:00,2016-01-26 17:29:00,Business,Cary,Cary,0.5,Errand/Supplies
420,2016-06-08 17:16:00,2016-06-08 17:18:00,Business,Soho,Tribeca,0.5,Errand/Supplies


In [14]:
uber_data.sort_values(by=['START*','MILES*'],ascending=[True,False])

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
906,2016-11-04 21:04:00,2016-11-04 21:20:00,Business,Agnew,Cory,4.3,
911,2016-11-06 10:50:00,2016-11-06 11:04:00,Business,Agnew,Renaissance,2.4,
908,2016-11-05 08:34:00,2016-11-05 08:43:00,Business,Agnew,Renaissance,2.2,
910,2016-11-05 19:20:00,2016-11-05 19:28:00,Business,Agnew,Agnew,2.2,
879,2016-10-30 12:58:00,2016-10-30 13:18:00,Business,Almond,Bryson City,15.2,
...,...,...,...,...,...,...,...
889,2016-11-01 17:35:00,2016-11-01 17:42:00,Business,Whitebridge,Whitebridge,1.2,
890,2016-11-01 19:14:00,2016-11-01 19:20:00,Business,Whitebridge,Whitebridge,1.0,
516,2016-07-05 16:48:00,2016-07-05 16:52:00,Business,Whitebridge,Whitebridge,0.6,Errand/Supplies
870,2016-10-28 18:13:00,2016-10-28 20:07:00,Business,Winston Salem,Asheville,133.6,Meeting


In [23]:
import numpy as np
uber_data["MILES_CAT"]=np.where(uber_data['MILES*']>100,"Long trip","Short trip")
uber_data.head()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,MILES_CAT,nc
0,2016-01-01 21:11:00,2016-01-01 21:17:00,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,Short trip,10
1,2016-01-02 01:25:00,2016-01-02 01:37:00,Business,Fort Pierce,Fort Pierce,5.0,,Short trip,10
2,2016-01-02 20:25:00,2016-01-02 20:38:00,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,Short trip,10
3,2016-01-05 17:31:00,2016-01-05 17:45:00,Business,Fort Pierce,Fort Pierce,4.7,Meeting,Short trip,10
4,2016-01-06 14:42:00,2016-01-06 15:49:00,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,Short trip,10


In [28]:
uber_data['nc']=10
uber_data.head()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,MILES_CAT,nc,TRIP
0,2016-01-01 21:11:00,2016-01-01 21:17:00,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,Short trip,10,Short trip
1,2016-01-02 01:25:00,2016-01-02 01:37:00,Business,Fort Pierce,Fort Pierce,5.0,,Short trip,10,Short trip
2,2016-01-02 20:25:00,2016-01-02 20:38:00,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,Short trip,10,Short trip
3,2016-01-05 17:31:00,2016-01-05 17:45:00,Business,Fort Pierce,Fort Pierce,4.7,Meeting,Short trip,10,Short trip
4,2016-01-06 14:42:00,2016-01-06 15:49:00,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,Short trip,10,Short trip


In [27]:
import numpy as np
uber_data["TRIP"] = np.where(uber_data['MILES*'] <= 100, "Short trip", np.where(uber_data['MILES*'] <= 200,"Medium trip","Long trip"))
uber_data

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,MILES_CAT,nc,TRIP
0,2016-01-01 21:11:00,2016-01-01 21:17:00,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,Short trip,10,Short trip
1,2016-01-02 01:25:00,2016-01-02 01:37:00,Business,Fort Pierce,Fort Pierce,5.0,,Short trip,10,Short trip
2,2016-01-02 20:25:00,2016-01-02 20:38:00,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,Short trip,10,Short trip
3,2016-01-05 17:31:00,2016-01-05 17:45:00,Business,Fort Pierce,Fort Pierce,4.7,Meeting,Short trip,10,Short trip
4,2016-01-06 14:42:00,2016-01-06 15:49:00,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,Short trip,10,Short trip
...,...,...,...,...,...,...,...,...,...,...
1151,2016-12-31 13:24:00,2016-12-31 13:42:00,Business,Kar?chi,Unknown Location,3.9,Temporary Site,Short trip,10,Short trip
1152,2016-12-31 15:03:00,2016-12-31 15:38:00,Business,Unknown Location,Unknown Location,16.2,Meeting,Short trip,10,Short trip
1153,2016-12-31 21:32:00,2016-12-31 21:50:00,Business,Katunayake,Gampaha,6.4,Temporary Site,Short trip,10,Short trip
1154,2016-12-31 22:08:00,2016-12-31 23:51:00,Business,Gampaha,Ilukwatta,48.2,Temporary Site,Short trip,10,Short trip


In [29]:
uber_data.drop(columns=['MILES_CAT'])


Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,nc,TRIP
0,2016-01-01 21:11:00,2016-01-01 21:17:00,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,10,Short trip
1,2016-01-02 01:25:00,2016-01-02 01:37:00,Business,Fort Pierce,Fort Pierce,5.0,,10,Short trip
2,2016-01-02 20:25:00,2016-01-02 20:38:00,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,10,Short trip
3,2016-01-05 17:31:00,2016-01-05 17:45:00,Business,Fort Pierce,Fort Pierce,4.7,Meeting,10,Short trip
4,2016-01-06 14:42:00,2016-01-06 15:49:00,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,10,Short trip
...,...,...,...,...,...,...,...,...,...
1151,2016-12-31 13:24:00,2016-12-31 13:42:00,Business,Kar?chi,Unknown Location,3.9,Temporary Site,10,Short trip
1152,2016-12-31 15:03:00,2016-12-31 15:38:00,Business,Unknown Location,Unknown Location,16.2,Meeting,10,Short trip
1153,2016-12-31 21:32:00,2016-12-31 21:50:00,Business,Katunayake,Gampaha,6.4,Temporary Site,10,Short trip
1154,2016-12-31 22:08:00,2016-12-31 23:51:00,Business,Gampaha,Ilukwatta,48.2,Temporary Site,10,Short trip


In [30]:
uber_data.head()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,MILES_CAT,nc,TRIP
0,2016-01-01 21:11:00,2016-01-01 21:17:00,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,Short trip,10,Short trip
1,2016-01-02 01:25:00,2016-01-02 01:37:00,Business,Fort Pierce,Fort Pierce,5.0,,Short trip,10,Short trip
2,2016-01-02 20:25:00,2016-01-02 20:38:00,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,Short trip,10,Short trip
3,2016-01-05 17:31:00,2016-01-05 17:45:00,Business,Fort Pierce,Fort Pierce,4.7,Meeting,Short trip,10,Short trip
4,2016-01-06 14:42:00,2016-01-06 15:49:00,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,Short trip,10,Short trip


In [31]:
total_trips= uber_data['TRIP'].value_counts()
print(total_trips)

TRIP
Short trip     1139
Medium trip      14
Long trip         3
Name: count, dtype: int64


In [33]:
uber_data.groupby('START*')['MILES*'].agg('mean')

START*
Agnew                2.775000
Almond              15.200000
Apex                 5.341176
Arabi               17.000000
Arlington            4.900000
                      ...    
West University      2.200000
Weston               4.000000
Westpark Place       2.182353
Whitebridge          4.020588
Winston Salem      133.600000
Name: MILES*, Length: 177, dtype: float64

In [34]:
uber_data.groupby('PURPOSE*')['MILES*'].agg('mean')

PURPOSE*
Airport/Travel       5.500000
Between Offices     10.944444
Charity ($)         15.100000
Commute            180.200000
Customer Visit      20.688119
Errand/Supplies      3.968750
Meal/Entertain       5.698125
Meeting             15.247594
Moving               4.550000
Temporary Site      10.474000
Name: MILES*, dtype: float64

In [37]:
grouped = uber_data.groupby('CATEGORY*')['MILES*'].agg(['sum', 'mean', 'max'])
print(grouped)

               sum       mean    max
CATEGORY*                           
Business   11487.0  10.655844  310.3
Personal     717.7   9.320779  180.2
