## Pandas Introduction

## Task 1: Pandas Basics

### Creating a Series

A Pandas Series is a one-dimensional labeled array.

In [8]:
import pandas as pd

# Create a Series from a list
data = [10, 20, 30, 40, 50]
series = pd.Series(data, name='Numbers')
print(series)

0    10
1    20
2    30
3    40
4    50
Name: Numbers, dtype: int64


### Creating a DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns that can be of different types.

In [18]:
# Create a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston


### Inspecting Data Structures

#### .head()
shows the first n rows (default is 5)
#### tail() 
shows the last n rows (default is 5)

In [178]:
print(df.head(2))  # First 2 rows
print(df.tail(1))  # Last 1 row

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   

  species  sepal area (cm²)  
0  setosa             17.85  
1  setosa             14.70  
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
149                5.9               3.0                5.1               1.8   

       species  sepal area (cm²)  
149  virginica              17.7  


#### .shape
Returns a tuple representing the dimensionality of the DataFrame.

In [29]:
print(df.shape) 

(4, 3)


#### .columns
Returns the column labels.

In [33]:
print(df.columns)

Index(['Name', 'Age', 'City'], dtype='object')


#### .index
Returns the index (row labels).

In [36]:
print(df.index)

RangeIndex(start=0, stop=4, step=1)


#### .dtypes
Returns the data types of each column.

In [40]:
print(df.dtypes)

Name    object
Age      int64
City    object
dtype: object


## Task 2: Load and View Data - Iris Dataset

### Loading the Data

In [46]:
import pandas as pd
from sklearn.datasets import load_iris

# Load the iris dataset from scikit-learn
iris = load_iris()

# Convert it to a pandas DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add the target (species) column
df['species'] = iris.target

# Map target numbers to species names for better readability
df['species'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

print("Data loaded successfully!")

Data loaded successfully!


### Exploring the Data

#### 1. Using .info()
This gives us a concise summary of the DataFrame including data types and non-null values.

In [50]:
print("\nDataFrame Info:")
print(df.info())


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


#### 2. Using .describe()
This provides descriptive statistics for numerical columns.

In [63]:
print("\nDescriptive Statistics:")
print(df.describe())


Descriptive Statistics:
       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count         150.000000        150.000000         150.000000   
mean            5.843333          3.057333           3.758000   
std             0.828066          0.435866           1.765298   
min             4.300000          2.000000           1.000000   
25%             5.100000          2.800000           1.600000   
50%             5.800000          3.000000           4.350000   
75%             6.400000          3.300000           5.100000   
max             7.900000          4.400000           6.900000   

       petal width (cm)  
count        150.000000  
mean           1.199333  
std            0.762238  
min            0.100000  
25%            0.300000  
50%            1.300000  
75%            1.800000  
max            2.500000  


#### 3. Checking for Null Values with .isnull().sum()

In [67]:
print("\nNull Value Count:")
print(df.isnull().sum())


Null Value Count:
sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
species              0
dtype: int64


### Key Findings

**1. Null Values:** There are no null values in this dataset (all columns show 0 nulls) 

**2. Data Types:**
   - 4 float64 columns (all measurements in cm)  
   - 1 object column (species name)  

**3. Dimensions:**
   - 150 rows (observations)
   - 5 columns (4 features + 1 target variable)

### Viewing the First Few Rows

In [121]:
print("\nFirst 5 rows:")
print(df.head())


First 5 rows:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

  species  
0  setosa  
1  setosa  
2  setosa  
3  setosa  
4  setosa  


## Task 3: Access & Filter Data

### Accessing Data

#### 1. Accessing Columns

In [127]:
# Access a single column
sepal_length = df['sepal length (cm)']
print("\nSepal Length column:")
print(sepal_length.head())

# Access multiple columns
subset = df[['sepal length (cm)', 'petal length (cm)', 'species']]
print("\nMultiple columns:")
print(subset.head())


Sepal Length column:
0    5.1
1    4.9
2    4.7
3    4.6
4    5.0
Name: sepal length (cm), dtype: float64

Multiple columns:
   sepal length (cm)  petal length (cm) species
0                5.1                1.4  setosa
1                4.9                1.4  setosa
2                4.7                1.3  setosa
3                4.6                1.5  setosa
4                5.0                1.4  setosa


#### 2. Using .loc[] (label-based indexing)

In [130]:
# Access specific rows and columns by label
print("\n.loc[] examples:")
print(df.loc[0:2, ['sepal width (cm)', 'species']])  # Rows 0-2, specific columns
print(df.loc[df['species'] == 'versicolor'])  # All rows where species is versicolor


.loc[] examples:
   sepal width (cm) species
0               3.5  setosa
1               3.0  setosa
2               3.2  setosa
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
50                7.0               3.2                4.7               1.4   
51                6.4               3.2                4.5               1.5   
52                6.9               3.1                4.9               1.5   
53                5.5               2.3                4.0               1.3   
54                6.5               2.8                4.6               1.5   
55                5.7               2.8                4.5               1.3   
56                6.3               3.3                4.7               1.6   
57                4.9               2.4                3.3               1.0   
58                6.6               2.9                4.6               1.3   
59                5.2               2.7                3.9            

#### 3. Using .iloc[] (position-based indexing)

In [133]:
# Access specific rows and columns by position
print("\n.iloc[] examples:")
print(df.iloc[0:3, 1:4])  # Rows 0-2, columns 1-3
print(df.iloc[[0, 50, 100]])  # Specific rows (first of each species)


.iloc[] examples:
   sepal width (cm)  petal length (cm)  petal width (cm)
0               3.5                1.4               0.2
1               3.0                1.4               0.2
2               3.2                1.3               0.2
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  5.1               3.5                1.4               0.2   
50                 7.0               3.2                4.7               1.4   
100                6.3               3.3                6.0               2.5   

        species  
0        setosa  
50   versicolor  
100   virginica  


### Adding a New Column

In [136]:
# Add a new calculated column
df['sepal area (cm²)'] = df['sepal length (cm)'] * df['sepal width (cm)']
print("\nDataFrame with new sepal area column:")
print(df.head())


DataFrame with new sepal area column:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

  species  sepal area (cm²)  
0  setosa             17.85  
1  setosa             14.70  
2  setosa             15.04  
3  setosa             14.26  
4  setosa             18.00  


### Dropping Data

#### 1. Dropping Columns

In [140]:
# Drop the sepal area column we just created
df_dropped = df.drop('sepal area (cm²)', axis=1)
print("\nDataFrame after dropping sepal area column:")
print(df_dropped.head())

# Drop multiple columns
df_dropped = df.drop(['sepal length (cm)', 'sepal width (cm)'], axis=1)
print("\nDataFrame after dropping sepal measurements:")
print(df_dropped.head())


DataFrame after dropping sepal area column:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

  species  
0  setosa  
1  setosa  
2  setosa  
3  setosa  
4  setosa  

DataFrame after dropping sepal measurements:
   petal length (cm)  petal width (cm) species  sepal area (cm²)
0                1.4               0.2  setosa             17.85
1                1.4               0.2  setosa             14.70
2                1.3               0.2  setosa             15.04
3                1.5               0.2  setosa             14.26
4                1.4               0.

#### 2. Dropping Rows

In [143]:
# Drop rows by index
df_dropped_rows = df.drop([0, 1, 2])  # Drop first three rows
print("\nDataFrame after dropping first three rows:")
print(df_dropped_rows.head())

# Drop rows based on condition
df_no_setosa = df[df['species'] != 'setosa']  # Alternative to drop
print("\nDataFrame without setosa species:")
print(df_no_setosa.head())


DataFrame after dropping first three rows:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   
5                5.4               3.9                1.7               0.4   
6                4.6               3.4                1.4               0.3   
7                5.0               3.4                1.5               0.2   

  species  sepal area (cm²)  
3  setosa             14.26  
4  setosa             18.00  
5  setosa             21.06  
6  setosa             15.64  
7  setosa             17.00  

DataFrame without setosa species:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
50                7.0               3.2                4.7               1.4   
51                6.4               3.2                4.5               1.5   
52                6.9     



These operations form the foundation for data manipulation in pandas, allowing you to select exactly the data you need for analysis.

## Task 4: Built-in Methods

### 1. Sorting Values with .sort_values()

In [154]:
# Sort by sepal length in descending order
sorted_df = df.sort_values('sepal length (cm)', ascending=False)

# Show top 5 longest sepals
print("Top 5 flowers by sepal length:")
print(sorted_df.head())

Top 5 flowers by sepal length:
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
131                7.9               3.8                6.4               2.0   
135                7.7               3.0                6.1               2.3   
122                7.7               2.8                6.7               2.0   
117                7.7               3.8                6.7               2.2   
118                7.7               2.6                6.9               2.3   

       species  sepal area (cm²)  
131  virginica             30.02  
135  virginica             23.10  
122  virginica             21.56  
117  virginica             29.26  
118  virginica             20.02  


### 2. Counting Values with .value_counts()

In [157]:
# Count occurrences of each species
species_counts = df['species'].value_counts()

print("\nSpecies distribution:")
print(species_counts)


Species distribution:
species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64


### 3. Finding Unique Values with .unique()

In [164]:
# Get unique species names
unique_species = df['species'].unique()

print("\nUnique species in dataset:")
print(unique_species)


Unique species in dataset:
['setosa' 'versicolor' 'virginica']


### 4. Basic Statistics Methods

#### Column-wise statistics:

In [168]:
# Calculate mean sepal length
mean_sepal_length = df['sepal length (cm)'].mean()

# Calculate minimum petal width
min_petal_width = df['petal width (cm)'].min()

# Calculate maximum sepal width
max_sepal_width = df['sepal width (cm)'].max()

# Sum of all petal lengths
total_petal_length = df['petal length (cm)'].sum()

print("\nBasic Statistics:")
print(f"Mean sepal length: {mean_sepal_length:.2f} cm")
print(f"Minimum petal width: {min_petal_width:.2f} cm") 
print(f"Maximum sepal width: {max_sepal_width:.2f} cm")
print(f"Total petal length: {total_petal_length:.2f} cm")


Basic Statistics:
Mean sepal length: 5.84 cm
Minimum petal width: 0.10 cm
Maximum sepal width: 4.40 cm
Total petal length: 563.70 cm


### Key Takeaways

**1. Sorting:**  
   - .sort_values() - Reorder your data based on column values
   - Essential for ranking and identifying extremes

**2. Categorical Analysis:**
- .value_counts() - Count occurrences of each category  
- .unique() - List all distinct categories  

**3. Numerical Analysis:** 
- .mean()/.sum()/.min()/.max() - Basic descriptive statistics
- Often used with groupby() for segmented analysis