
# Comprehensive Tutorial on the Python Pandas Library

## What is Pandas?

Pandas is an open-source Python library for data manipulation and analysis, created in 2008 by **Wes McKinney** while working at AQR Capital Management. It has since evolved into one of the most essential tools for data scientists, enabling efficient handling and analysis of structured data.

Pandas provides:
- Powerful tools for data cleaning, manipulation, and analysis.
- Easy handling of missing data.
- Support for data from various formats (CSV, Excel, SQL, etc.).

In this tutorial, you'll learn the basics of Pandas, explore its key features, and practice working with an example dataset. By the end, you'll be equipped with the knowledge to perform essential data operations.

---

## Installation

You can install Pandas using either `pip` or `conda`:

Using `pip`:
```bash
pip install pandas
```

Using `conda`:
```bash
conda install pandas
```

Make sure you have Python installed before running these commands.

---



## Loading Data

To begin working with a dataset in Pandas, you first load it into a DataFrame. A DataFrame is a two-dimensional tabular data structure with labeled rows and columns.

### Example: Loading a CSV File

We'll use the `read_csv()` function to load the provided dataset (`UberDataset.csv`). This dataset contains information about Uber trips, including start and end dates, trip categories, and mileage.

The syntax is:
```python
df = pd.read_csv('file_path.csv')
```

Let's load the dataset:


In [1]:

import pandas as pd

# Load the Uber dataset
file_path = 'UberDataset.csv'
df = pd.read_csv(file_path)

# Display the first few rows
df.head()


Unnamed: 0,START_DATE,END_DATE,CATEGORY,START,STOP,MILES,PURPOSE
0,01-01-2016 21:11,01-01-2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
1,01-02-2016 01:25,01-02-2016 01:37,Business,Fort Pierce,Fort Pierce,5.0,
2,01-02-2016 20:25,01-02-2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
3,01-05-2016 17:31,01-05-2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting
4,01-06-2016 14:42,01-06-2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit



## Inspecting Data

After loading data, it's important to inspect it to understand its structure and contents. Pandas provides several methods for this:

- `df.head()`: View the first 5 rows of the dataset (or specify the number of rows, e.g., `df.head(10)`).
- `df.tail()`: View the last 5 rows of the dataset.
- `df.info()`: Summary of the dataset, including the number of non-null entries and data types.
- `df.describe()`: Statistical summary of numerical columns.
- `df.shape`: Dimensions of the dataset (rows, columns).
- `df.columns`: List column names.
- `df.dtypes`: Data types of each column.
- `df.isnull().sum()`: Check for missing values.

### Example:


In [2]:
# Display first and last rows
print("First 5 rows:")
print(df.head())

print("\nLast 5 rows:")
print(df.tail())

First 5 rows:
         START_DATE          END_DATE  CATEGORY        START             STOP  \
0  01-01-2016 21:11  01-01-2016 21:17  Business  Fort Pierce      Fort Pierce   
1  01-02-2016 01:25  01-02-2016 01:37  Business  Fort Pierce      Fort Pierce   
2  01-02-2016 20:25  01-02-2016 20:38  Business  Fort Pierce      Fort Pierce   
3  01-05-2016 17:31  01-05-2016 17:45  Business  Fort Pierce      Fort Pierce   
4  01-06-2016 14:42  01-06-2016 15:49  Business  Fort Pierce  West Palm Beach   

   MILES          PURPOSE  
0    5.1   Meal/Entertain  
1    5.0              NaN  
2    4.8  Errand/Supplies  
3    4.7          Meeting  
4   63.7   Customer Visit  

Last 5 rows:
            START_DATE          END_DATE  CATEGORY             START  \
1151  12/31/2016 13:24  12/31/2016 13:42  Business           Kar?chi   
1152  12/31/2016 15:03  12/31/2016 15:38  Business  Unknown Location   
1153  12/31/2016 21:32  12/31/2016 21:50  Business        Katunayake   
1154  12/31/2016 22:08  12/31

In [3]:
# Dataset information and statistics
print("\nDataset Info:")
print(df.info())

print("\nStatistical Summary:")
print(df.describe())


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156 entries, 0 to 1155
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   START_DATE  1156 non-null   object 
 1   END_DATE    1155 non-null   object 
 2   CATEGORY    1155 non-null   object 
 3   START       1155 non-null   object 
 4   STOP        1155 non-null   object 
 5   MILES       1156 non-null   float64
 6   PURPOSE     653 non-null    object 
dtypes: float64(1), object(6)
memory usage: 63.3+ KB
None

Statistical Summary:
              MILES
count   1156.000000
mean      21.115398
std      359.299007
min        0.500000
25%        2.900000
50%        6.000000
75%       10.400000
max    12204.700000


In [4]:
# Dataset dimensions and column information
print("\nShape of the dataset:")
print(df.shape)

print("\nColumn names:")
print(df.columns)

print("\nData types of each column:")
print(df.dtypes)

print("\nMissing values in each column:")
print(df.isnull().sum())


Shape of the dataset:
(1156, 7)

Column names:
Index(['START_DATE', 'END_DATE', 'CATEGORY', 'START', 'STOP', 'MILES',
       'PURPOSE'],
      dtype='object')

Data types of each column:
START_DATE     object
END_DATE       object
CATEGORY       object
START          object
STOP           object
MILES         float64
PURPOSE        object
dtype: object

Missing values in each column:
START_DATE      0
END_DATE        1
CATEGORY        1
START           1
STOP            1
MILES           0
PURPOSE       503
dtype: int64
