# Pandas Tutorial for Beginners

Welcome to this comprehensive tutorial on **Pandas**. This notebook is designed to teach you the basics of the Pandas library in Python. We will go through concepts slowly, repeat key ideas to ensure they stick, and practice with exercises.

## Table of Contents
1. Introduction to Pandas
2. Getting Started (Installation & Import)
3. Pandas Series
4. Pandas DataFrames
5. Reading CSV Files
6. Analyzing Data
7. Summary & Final Exercises

## 1. Introduction to Pandas

### What is Pandas?
Pandas is a Python library used for working with data sets. It allows us to analyze, clean, explore, and manipulate data.

The name "Pandas" references both "Panel Data" and "Python Data Analysis". It was created by Wes McKinney in 2008.

### Why Use Pandas?
* **Big Data Analysis:** It helps analyze big data and make conclusions based on statistical theories.
* **Data Cleaning:** It can clean messy data sets (removing empty or wrong values), making them readable and relevant.
* **Data Science:** It is a fundamental tool in Data Science for deriving information from raw data.

### What Can Pandas Do?
Pandas can answer questions about your data, such as:
* What is the average, minimum, or maximum value?
* Is there a correlation between columns?
* It can also delete rows that are irrelevant or contain errors (cleaning).

**Recap:** Pandas is *the* tool in Python for making sense of dataâ€”whether you need to calculate statistics, clean up messy files, or prepare data for machine learning.

## 2. Getting Started

### Installation
If you have Python and PIP installed, you can install Pandas using the command line:

`pip install pandas`

(Note: If you are using Anaconda, it is likely already installed.)

### Importing Pandas
To use Pandas in your Python code, you must import it. We usually import it under the **alias** `pd`.

**Why use an alias?** An alias is an alternate name. Using `pd` saves us from typing `pandas` every time we want to call a function. It is a standard convention used by almost everyone.

In [1]:
import pandas as pd

# Check the version of Pandas to ensure it is working
print("Pandas version:", pd.__version__)

Pandas version: 2.3.3


**Explanation of the code:**
1. `import pandas as pd`: This brings the Pandas library into our environment and renames it `pd`.
2. `pd.__version__`: This attribute holds the version number of the library.

## 3. Pandas Series

### What is a Series?
A Pandas **Series** is like a column in a table. It is a one-dimensional array holding data of any type (integers, strings, floating point numbers, python objects, etc.).

Think of it as a simple list of items, but with an index (labels) attached to each item.

### Creating a Series
You can create a Series from a Python list.

In [2]:
import pandas as pd

a = [1, 7, 2]
my_series = pd.Series(a)

print(my_series)

0    1
1    7
2    2
dtype: int64


In [5]:
type(my_series)

pandas.core.series.Series

### Labels (Indexes)
Notice the output above has two columns. The right column is your data `[1, 7, 2]`. The left column `[0, 1, 2]` is the **Index**.

If you don't specify an index, Pandas assigns a default one starting from 0.

You can access items using this index:

In [7]:
print(my_series[1])  # Returns the first item (1)

7


### Creating Custom Labels
You can name your own labels using the `index` argument. This makes the Series work almost like a dictionary.

In [8]:
a = [1, 7, 2]
my_series = pd.Series(a, index = ["x", "y", "z"])

print(my_series)
#print("Value at y:", my_series["y"])

x    1
y    7
z    2
dtype: int64


### Key/Value Objects (Dictionaries) as Series
You can also create a Series directly from a dictionary. The keys of the dictionary become the labels (index).

In [9]:
calories = {"day1": 420, "day2": 380, "day3": 390}
my_series = pd.Series(calories)

print(my_series)

day1    420
day2    380
day3    390
dtype: int64


--- 
### ðŸŸ¢ Exercise 1: Series
1. Create a list named `scores` with values: `[10, 20, 30]`.
2. Create a Pandas Series from this list.
3. Assign custom labels `["math", "science", "history"]` to it.
4. Print the score for "science".

In [21]:
# WRITE YOUR CODE HERE
scores = [10, 20, 30]
my_series = pd.Series(scores, index = ["math", "science", "history"])
print(my_series[1])


20


  print(my_series[1])


## 4. Pandas DataFrames

### What is a DataFrame?
A Pandas **DataFrame** is a 2-dimensional data structure, like a 2-dimensional array, or a table with rows and columns.

* **Series** is like a column.
* **DataFrame** is the whole table.

### Creating a DataFrame
We often create DataFrames from a dictionary of lists.

In [17]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

# Load data into a DataFrame object:
df = pd.DataFrame(data)

print(df)

   calories  duration
0       420        50
1       380        40
2       390        45


### Locating Rows
Pandas use the `loc` attribute to return one or more specified rows.

1. `df.loc[0]`: Returns a Series representing the row at index 0.
2. `df.loc[[0, 1]]`: Returns a new DataFrame with rows 0 and 1.

In [11]:
# Return row 0
print(df.loc[0])

calories    420
duration     50
Name: 0, dtype: int64


In [18]:
print(df.loc[[0, 1]])

   calories  duration
0       420        50
1       380        40


### Named Indexes in DataFrames
Just like Series, you can name the indexes in a DataFrame using the `index` argument.

Once named, you can use `loc` with the name to find the row.

In [19]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

      calories  duration
day1       420        50
day2       380        40
day3       390        45


In [20]:
# Refer to the named index:
print(df.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


--- 
### ðŸŸ¢ Exercise 2: DataFrames
1. Create a dictionary called `car_data` with two keys: `"brand"` (["Ford", "BMW", "Fiat"]) and `"year"` ([2010, 2015, 2020]).
2. Create a DataFrame `cars_df` from this dictionary.
3. Use `loc` to print the first row (index 0).

In [26]:
# WRITE YOUR CODE HERE
car_data = {
    "brand": ["Ford", "BMW", "Fiat"],
    "year": [2010, 2015, 2020]
}
df = pd.DataFrame(car_data)
print(df.loc[0])

brand    Ford
year     2010
Name: 0, dtype: object


## 5. Reading CSV Files

A simple way to store big data sets is to use CSV files (comma separated values). Pandas is excellent at reading these.

### Creating a Dummy CSV
Since we are in a notebook, let's first create a CSV file named `data.csv` so we can practice reading it.

In [27]:
# Run this cell to create the file 'data.csv' in your environment
csv_content = """
Duration,Pulse,Maxpulse,Calories
60,110,130,409.1
60,117,145,479.0
60,103,135,340.0
45,109,175,282.4
45,117,148,406.0
60,102,127,300.0
"""

with open("data.csv", "w") as f:
    f.write(csv_content.strip())

print("data.csv created successfully!")

data.csv created successfully!


### Loading the CSV
Use `pd.read_csv('filename')` to load the data into a DataFrame.

In [28]:
df = pd.read_csv('data.csv')

print(df)

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0
5        60    102       127     300.0


### `to_string()`
If you want to print the entire DataFrame (useful for smaller datasets), use `to_string()`.

In [29]:
print(df.to_string())

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0
5        60    102       127     300.0


### Max Rows Configuration
If a DataFrame has more rows than the system's limit (default is often 60), printing the DataFrame will only show the top 5 and bottom 5 rows.

You can check this limit with `pd.options.display.max_rows`.
You can change this limit with `pd.options.display.max_rows = 9999`.

In [None]:
print("Max rows display limit:", pd.options.display.max_rows)

## 6. Analyzing Data

Once we have data, we need to inspect it. Pandas provides great tools for a quick overview.

### `head()`
The `head()` method returns the headers and a specified number of rows, starting from the top.
* `df.head(10)`: Top 10 rows.
* `df.head()`: Top 5 rows (default).

In [30]:
df = pd.read_csv('data.csv')
#print("--- First 3 Rows ---")
print(df.head(3))

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0


### `tail()`
The `tail()` method returns the headers and the last rows.
* `df.tail()`: Last 5 rows (default).

In [31]:
print(df.tail(3))

   Duration  Pulse  Maxpulse  Calories
3        45    109       175     282.4
4        45    117       148     406.0
5        60    102       127     300.0


### `info()`
The `info()` method gives you a concise summary of the DataFrame.
It tells you:
1.  The number of rows (RangeIndex).
2.  The number of columns.
3.  The name and data type (`Dtype`) of each column.
4.  The number of non-null values (identifying missing data).
5.  Memory usage.

This is crucial for deciding how to clean your data.

In [32]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  6 non-null      int64  
 1   Pulse     6 non-null      int64  
 2   Maxpulse  6 non-null      int64  
 3   Calories  6 non-null      float64
dtypes: float64(1), int64(3)
memory usage: 324.0 bytes
None


--- 
### ðŸŸ¢ Exercise 3: Analysis
1. Load `data.csv` into a dataframe variable called `fitness_df`.
2. Print the first 2 rows using `head()`.
3. Print the summary using `info()`.

In [34]:
# WRITE YOUR CODE HERE
df = pd.read_csv('data.csv')
print(df.head(2))
print(df.info())

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  6 non-null      int64  
 1   Pulse     6 non-null      int64  
 2   Maxpulse  6 non-null      int64  
 3   Calories  6 non-null      float64
dtypes: float64(1), int64(3)
memory usage: 324.0 bytes
None


## 7. Summary

Congratulations! You have learned the basics of Pandas.

**Recap of what we learned:**
1.  **Series**: A 1D column with an index. Create with `pd.Series()`.
2.  **DataFrames**: A 2D table. Create with `pd.DataFrame()`.
3.  **Indexing**: Use `loc` to access rows. Use named indexes for clearer code.
4.  **CSV**: Use `pd.read_csv()` to load data files.
5.  **Analyzing**: Use `head()`, `tail()`, and `info()` to understand your data before working on it.

Keep practicing by trying to load your own CSV files and inspecting them!