**Lab Seatwork 3 - Intro to Dataframe**

# DataFrame Basics in Python
This module provides a foundational understanding of DataFrames, a core data structure in the pandas library, essential for data manipulation and analysis in Python.

# What is a DataFrame?
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Think of it as:
1. A spreadsheet (like Excel)
1. A SQL table
1. A dictionary of Series objects

It's the most commonly used pandas object, and it allows you to store and manipulate data in a structured way.

# Key Characteristics of DataFrames
1. Labeled Axes: Both rows and columns have labels (indices for rows, column names for columns). This makes data access intuitive.
1.Heterogeneous Data: Columns can have different data types (e.g., one column can be integers, another strings, and another booleans).
1. Size Mutable: You can add or remove columns and rows.
1. Value Mutable: Data within the DataFrame can be changed.

# Creating DataFrames
DataFrames can be created in various ways, most commonly from:
1. Dictionaries of lists or Series
1. Lists of dictionaries
1. NumPy arrays
1. CSV or other external files

Let's look at some examples of how to create DataFrames.

## Viewing Data

In [1]:
# Setup the Pandas Library
import pandas as pd

In [2]:
# Read the CSV SalaryData file and put it on a data frame
df = pd.read_csv('SalaryData.csv')
df

Unnamed: 0,Name,Age,City,Occupation,Salary
0,Alice,25,New York,Engineer,70000
1,Bob,30,Los Angeles,Artist,85000
2,Charlie,35,Chicago,Doctor,120000
3,David,28,New York,Engineer,75000
4,Eve,32,Houston,Scientist,90000
5,Frank,40,Miami,Manager,110000
6,Grace,22,Boston,Designer,60000


In [3]:
# Display the first 5 rows
df.head()

Unnamed: 0,Name,Age,City,Occupation,Salary
0,Alice,25,New York,Engineer,70000
1,Bob,30,Los Angeles,Artist,85000
2,Charlie,35,Chicago,Doctor,120000
3,David,28,New York,Engineer,75000
4,Eve,32,Houston,Scientist,90000


In [4]:
# Display the last 5 rows
df.tail()

Unnamed: 0,Name,Age,City,Occupation,Salary
2,Charlie,35,Chicago,Doctor,120000
3,David,28,New York,Engineer,75000
4,Eve,32,Houston,Scientist,90000
5,Frank,40,Miami,Manager,110000
6,Grace,22,Boston,Designer,60000


In [5]:
# Concise summary of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        7 non-null      object
 1   Age         7 non-null      int64 
 2   City        7 non-null      object
 3   Occupation  7 non-null      object
 4   Salary      7 non-null      int64 
dtypes: int64(2), object(3)
memory usage: 412.0+ bytes


In [6]:
# Generates descriptive statistics
df.describe()

Unnamed: 0,Age,Salary
count,7.0,7.0
mean,30.285714,87142.857143
std,6.074929,21574.89723
min,22.0,60000.0
25%,26.5,72500.0
50%,30.0,85000.0
75%,33.5,100000.0
max,40.0,120000.0


In [7]:
# Shows the dimensions of the df (number of rows, colums)
df.shape

(7, 5)

## Selecting Columns

In [8]:
# Select one column: name
df['Name']

0      Alice
1        Bob
2    Charlie
3      David
4        Eve
5      Frank
6      Grace
Name: Name, dtype: object

In [9]:
# Display multiple colums
df[['Name', 'Age']]

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35
3,David,28
4,Eve,32
5,Frank,40
6,Grace,22


## Adding and Modifying Columns

In [10]:
# Add new column: Experience
df['Experience'] = [2, 5, 10, 3,7, 15, 1]
df

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,25,New York,Engineer,70000,2
1,Bob,30,Los Angeles,Artist,85000,5
2,Charlie,35,Chicago,Doctor,120000,10
3,David,28,New York,Engineer,75000,3
4,Eve,32,Houston,Scientist,90000,7
5,Frank,40,Miami,Manager,110000,15
6,Grace,22,Boston,Designer,60000,1


In [11]:
# Modify value of existing column: Increment age by 1 year
df['Age'] = df['Age'] + 1
df

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,26,New York,Engineer,70000,2
1,Bob,31,Los Angeles,Artist,85000,5
2,Charlie,36,Chicago,Doctor,120000,10
3,David,29,New York,Engineer,75000,3
4,Eve,33,Houston,Scientist,90000,7
5,Frank,41,Miami,Manager,110000,15
6,Grace,23,Boston,Designer,60000,1


In [12]:
# Add new column based on condition: Seniority where >= 20 is 'Senior', others is 'Junior'
df['Seniority'] = ['Senior' if age >= 30 else 'Junior' for age in df['Age']]
df

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience,Seniority
0,Alice,26,New York,Engineer,70000,2,Junior
1,Bob,31,Los Angeles,Artist,85000,5,Senior
2,Charlie,36,Chicago,Doctor,120000,10,Senior
3,David,29,New York,Engineer,75000,3,Junior
4,Eve,33,Houston,Scientist,90000,7,Senior
5,Frank,41,Miami,Manager,110000,15,Senior
6,Grace,23,Boston,Designer,60000,1,Junior


## Deleting Columns

In [13]:
# Uaw drop() to delete a column
# Create a data frame df_no_seniority where Seniority is no longer included
df_no_seniority = df.drop('Seniority', axis=1)
df
df_no_seniority

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,26,New York,Engineer,70000,2
1,Bob,31,Los Angeles,Artist,85000,5
2,Charlie,36,Chicago,Doctor,120000,10
3,David,29,New York,Engineer,75000,3
4,Eve,33,Houston,Scientist,90000,7
5,Frank,41,Miami,Manager,110000,15
6,Grace,23,Boston,Designer,60000,1


In [14]:
# Delete column using del
del(df['Seniority'])
df

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,26,New York,Engineer,70000,2
1,Bob,31,Los Angeles,Artist,85000,5
2,Charlie,36,Chicago,Doctor,120000,10
3,David,29,New York,Engineer,75000,3
4,Eve,33,Houston,Scientist,90000,7
5,Frank,41,Miami,Manager,110000,15
6,Grace,23,Boston,Designer,60000,1


## Indexing and Selection

In [17]:
# Label based indexing
df.loc[1]

Name                  Bob
Age                    31
City          Los Angeles
Occupation         Artist
Salary              85000
Experience              5
Name: 1, dtype: object

In [16]:
# label based multiple rows
df.loc[[1,3]]

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
1,Bob,31,Los Angeles,Artist,85000,5
3,David,29,New York,Engineer,75000,3


In [18]:
# Using label range
df.loc[1:3]

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
1,Bob,31,Los Angeles,Artist,85000,5
2,Charlie,36,Chicago,Doctor,120000,10
3,David,29,New York,Engineer,75000,3


In [19]:
# Display rows with specific column
df.loc[2:, ['Name', 'Salary']]

Unnamed: 0,Name,Salary
2,Charlie,120000
3,David,75000
4,Eve,90000
5,Frank,110000
6,Grace,60000


In [20]:
# Displays specific rows and specific colums
df.loc[2:5, ['Name', 'Age', 'Salary']]

Unnamed: 0,Name,Age,Salary
2,Charlie,36,120000
3,David,29,75000
4,Eve,33,90000
5,Frank,41,110000


In [21]:
# Update values using labels
# Update salary of rew 0 from 70,000 to 72,000
df.loc[0,['Salary']] = 72000
df

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,26,New York,Engineer,72000,2
1,Bob,31,Los Angeles,Artist,85000,5
2,Charlie,36,Chicago,Doctor,120000,10
3,David,29,New York,Engineer,75000,3
4,Eve,33,Houston,Scientist,90000,7
5,Frank,41,Miami,Manager,110000,15
6,Grace,23,Boston,Designer,60000,1


In [22]:
# Using the integer-based location
# Display first row integer-based
df.iloc[3:5]

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
3,David,29,New York,Engineer,75000,3
4,Eve,33,Houston,Scientist,90000,7


In [23]:
# Integer-based displaying specific rows and specific colums
# Display 2nd row and column 0 to 4
df.iloc[1:3, 0:5]

Unnamed: 0,Name,Age,City,Occupation,Salary
1,Bob,31,Los Angeles,Artist,85000
2,Charlie,36,Chicago,Doctor,120000


In [24]:
#  Using integer-based, update salary of first row to 73,000
df.iloc[0,4] = 730000
df

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,26,New York,Engineer,730000,2
1,Bob,31,Los Angeles,Artist,85000,5
2,Charlie,36,Chicago,Doctor,120000,10
3,David,29,New York,Engineer,75000,3
4,Eve,33,Houston,Scientist,90000,7
5,Frank,41,Miami,Manager,110000,15
6,Grace,23,Boston,Designer,60000,1


## Filtering Data

In [25]:
# Filter rows where age is greater than or equal
df[df['Age'] >= 30]

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
1,Bob,31,Los Angeles,Artist,85000,5
2,Charlie,36,Chicago,Doctor,120000,10
4,Eve,33,Houston,Scientist,90000,7
5,Frank,41,Miami,Manager,110000,15


In [26]:
# Filter rows where Occupation is 'Engineer' and city is 'New York'
df[(df['Occupation'] == 'Engineer') & (df['City'] == 'New York')]

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,26,New York,Engineer,730000,2
3,David,29,New York,Engineer,75000,3


In [27]:
# Filter names that contains 'Alice' or 'Bob'
df[df['Name'].isin(['Bob', 'Alice'])]

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,26,New York,Engineer,730000,2
1,Bob,31,Los Angeles,Artist,85000,5


## Handling Missing Values