<a href="https://colab.research.google.com/github/BabakDavarmanesh/Learning_Pandas/blob/main/Learning_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pandas Tutorial: A Beginner-Friendly Guide to Data Analysis**

**Author:** *Babak Davarmanesh*


---



**Creating a DataFrame with Pandas**

In this example, we are going to create a simple DataFrame using pandas, which is a powerful data manipulation library in Python.

First, we import pandas using the alias pd. This is the standard practice in Python when working with pandas.

In [1]:
import pandas as pd

**Creating the DataFrame**

We create a DataFrame using the pd.DataFrame() function, which takes a dictionary of lists (or arrays) as input. In this case, we have three columns: Name, Age, and Sex.

In [23]:
df = pd.DataFrame(
    {
        "Name": [
                 "Braund, Mr. Owen Harris",
                 "Allen, Mr. William Henry",
                 "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)

**Displaying the DataFrame**

After creating the DataFrame, we can display it in two ways:

Using the print() function:

In [24]:
print(df)

                       Name  Age     Sex
0   Braund, Mr. Owen Harris   22    male
1  Allen, Mr. William Henry   35    male
2  Bonnell, Miss. Elizabeth   58  female


In [5]:
df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


**Accessing a Single Column**

If you want to access a specific column from the DataFrame, such as the Name column, you can do so by referencing the column name inside square brackets:

This will return a Series, which is essentially a single column of the DataFrame. The output will look like this:

In [6]:
df["Name"]

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Allen, Mr. William Henry"
2,"Bonnell, Miss. Elizabeth"


**Creating and Working with a Pandas Series**

Sometimes, we don't need to create an entire DataFrame. Instead, we can create a pandas Series, which is essentially a single column of data. This can be useful when you only need to work with one set of values.

You can create a Series like this:

In [7]:
MyCol  = pd.Series(["Canada", "USA", "Germany"], name="Nationality")
MyCol


Unnamed: 0,Nationality
0,Canada
1,USA
2,Germany


**Using Functions like max() on DataFrames and Series**

In pandas, both DataFrames and Series support a variety of functions that allow you to perform operations directly on the data.

For example, we can use the max() function to find the maximum value in a column of a DataFrame or in a Series.

**On DataFrame Column**

If you want to find the maximum value in a specific column of a DataFrame, you can access the column and apply the max() function like this:

In [8]:
df['Age'].max()


58

Here:

`df['Age']` accesses the Age column of the DataFrame df.

`.max()` then returns the maximum value from that column.

For example, if the Age column contains [22, 35, 58], calling `df['Age'].max() `will return: 58

On Series

Similarly, you can use the `max()` function on a Series as well:

In this case, the function determines the "maximum" value based on alphabetical order.


In [9]:
MyCol.max()

'USA'

The `.min()` function in pandas is used to find the minimum value in a column. When applied to a specific column, such as df['Age'], it returns the smallest value in that column.

In [25]:
df['Age'].min()

22

In [26]:
MyCol.min()

'Canada'



---



**Using df.describe() for Data Summary**

In pandas, we can use the describe() function to generate a statistical summary of a DataFrame. This function provides useful insights into the numerical columns of the dataset.

**What Does describe() Do?**
The describe() function calculates and displays the following statistics for all numerical columns in the DataFrame:

count: The number of non-null values.

mean: The average value.

std: The standard deviation (spread of data).

min: The smallest value.

25% (Q1): The first quartile (25th percentile).

50% (median or Q2): The middle value (50th percentile).

75% (Q3): The third quartile (75th percentile).

max: The largest value.


In [10]:
df.describe()

Unnamed: 0,Age
count,3.0
mean,38.333333
std,18.230012
min,22.0
25%,28.5
50%,35.0
75%,46.5
max,58.0


**describe() Works Only on Numerical Columns**

The describe() function in pandas only works on numerical columns by default. If a column contains non-numeric data (such as strings), it will be ignored.

Let's demonstrate this with an example where we temporarily modify a value in the Age column and then revert it back.

In [None]:
df['Age'].iloc[1] = None

In [12]:
df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22.0,male
1,"Allen, Mr. William Henry",,male
2,"Bonnell, Miss. Elizabeth",58.0,female


In [13]:
df.describe()


Unnamed: 0,Age
count,2.0
mean,40.0
std,25.455844
min,22.0
25%,31.0
50%,40.0
75%,49.0
max,58.0


In [None]:
df['Age'].iloc[1] = 35

**Using describe() on Non-Numerical Columns or Series**

While describe() is mainly used for numerical columns, it also works on non-numerical (categorical or string) data. When applied to a Series of text values, it provides different summary statistics.

*Explanation:*

count: Number of non-null values (ignores None or NaN).

unique: Number of unique values in the Series (a, b, and c → 3 unique values).

top: The most frequent value (a appears the most).

freq: The frequency of the top value (a appears 2 times).

In [15]:
s = pd.Series(['a', 'b', 'c', 'a', 'b', None, 'c'])
s.describe()

Unnamed: 0,0
count,6
unique,3
top,a
freq,2


**DataFrame.describe(include=...)**

By default, it only includes numerical columns. However, we can specify different data types using the include parameter:

`include="all"`

Includes all columns, regardless of their data type (numerical, categorical, etc.).

Non-numeric columns get statistics like count, unique, top (most frequent value), and frequency of the top value.

Numeric columns get count, mean, standard deviation, min, max, and quartiles.

`include="object"`

Includes only categorical (string) columns.

Provides statistics like count, unique values, most frequent value (top), and its frequency (freq).

Does not include numerical statistics like mean or standard deviation.

`include="number"`

Includes only numerical columns.

Provides statistics such as count, mean, std (standard deviation), min, 25th percentile, median (50th percentile), 75th percentile, and max.

In [16]:
frame = pd.DataFrame(
    {
        "col1": ["a" , "b", "c", "d"],
        "col2": [1,2,3,6]
    }
)

In [17]:
frame

Unnamed: 0,col1,col2
0,a,1
1,b,2
2,c,3
3,d,6


In [20]:
frame.describe(include="all")

Unnamed: 0,col1,col2
count,4,4.0
unique,4,
top,a,
freq,1,
mean,,3.0
std,,2.160247
min,,1.0
25%,,1.75
50%,,2.5
75%,,3.75


In [21]:
frame.describe(include="object")

Unnamed: 0,col1
count,4
unique,4
top,a
freq,1


In [22]:
frame.describe(include="number")

Unnamed: 0,col2
count,4.0
mean,3.0
std,2.160247
min,1.0
25%,1.75
50%,2.5
75%,3.75
max,6.0
