## Understanding Pandas DataFrame

Welcome to the notebook titled "Understanding Pandas DataFrame"! The robust data manipulation and analysis features of Pandas, a popular Python package for data manipulation and analysis, will be explored in this notebook.

For processing structured data, such as tabular data, CSV files, SQL tables, and more, Pandas offers the DataFrame, a highly effective and simple-to-use data structure. The DataFrame is an excellent option for data exploration, cleaning, transformation, and analysis because it is a two-dimensional labeled data structure with columns of potentially varied types.
Creating and Accessing DataFrames: Discover how to generate DataFrames from various data sources, including CSV files, Excel spreadsheets, and Python dictionaries. We will also investigate various methods for accessing and obtaining data from DataFrames.
1. Data Manipulation: Learn different methods for transforming and manipulating DataFrame data. Filtering rows, choosing columns, adding or removing columns, handling missing values, and grouping data for aggregation are a few of the procedures we'll examine.
2. Data Cleaning: Acquire knowledge about DataFrames' cleaning and preprocessing procedures. We will examine methods for dealing with missing numbers, eliminating duplicates, managing outliers, and converting data types.
3. Data Analysis: Explore DataFrames' analytical capabilities. Descriptive statistics, data visualization, handling of dates and timings, and complex data transformations will all be covered.
4. Data Integration: Learn how to mix and integrate data from many sources by merging, joining, and concatenating DataFrames. We'll look at methods for combining data based on shared keys or indices.


By the end of this notebook, you will be well equipped to carry out a variety of data manipulation, cleaning, and analysis activities effectively and have a firm grasp of the fundamental ideas of Pandas DataFrame.
Let's begin and utilize Pandas DataFrame to its total capacity for your data analysis endeavors!

In [1]:
# Import the needed library
import pandas as pd

In [2]:
# Create an empty DataFrame
df = pd.DataFrame()

In [3]:
# View the first 5 rows of the DataFrame
df.head()

In [4]:
# View the last 5 rows of the DataFrame
df.tail()

## Creating a DataFrame using a Dictionary

In [6]:
# Creating dictionary of list
data = {'FirstName':['Guy','Kevin','Roberto','Rob','Thierry','David','JoLynn','Ruth','Gail','Barry'], 'Department':['Production','Marketing','Engineering','Tool', 'Design',
'Production Control','Information Services','Human Resources','Information Services','Human Resources']}

# Convert the dictionary to a DataFrame
df = pd.DataFrame(data)

# View the DataFrame
df

Unnamed: 0,FirstName,Department
0,Guy,Production
1,Kevin,Marketing
2,Roberto,Engineering
3,Rob,Tool
4,Thierry,Design
5,David,Production Control
6,JoLynn,Information Services
7,Ruth,Human Resources
8,Gail,Information Services
9,Barry,Human Resources


## Creating a DataFrame using a List of Tuple

In [7]:
# Create a list of Tuble
data2 = [('Gilbert','1/28/2006'),('Brown','8/26/2006'),('Tamburello','6/11/2007')]

# View the data
data2

[('Gilbert', '1/28/2006'), ('Brown', '8/26/2006'), ('Tamburello', '6/11/2007')]

In [8]:
# Convert the data to a DataFrame
df = pd.DataFrame(data2, columns=['Name', 'Hired Date'])

# View the DataFrame
df.head()

Unnamed: 0,Name,Hired Date
0,Gilbert,1/28/2006
1,Brown,8/26/2006
2,Tamburello,6/11/2007


## Importing data from Excel source

In [9]:
# Import data from an excel file
filePath = 'data/people.xlsx'
excel_df = pd.read_excel(filePath)
# Get the first 5 rows of the DataFrame
excel_df.head()

Unnamed: 0,BusinessEntityID,FirstName,MiddleName,LastName
0,285,Syed,E,Abbas
1,293,Catherine,R.,Abel
2,38,Kim,B,Abercrombie
3,211,Hazem,E,Abolrous
4,121,Pilar,G,Ackerman


In [10]:
# Get the last 5 rows of the DataFrame
excel_df.tail()

Unnamed: 0,BusinessEntityID,FirstName,MiddleName,LastName
16,5055,Eduardo,A,Albright
17,16858,Elijah,L,Albury
18,3889,Fernando,S,Alcorn
19,301,Frances,B.,Alderson
20,16850,Gabriel,S,Alexander


In [11]:
# Select individual columns
bus_series = excel_df['BusinessEntityID']

# Display the data
bus_series

# Check the type
type(bus_series)

pandas.core.series.Series

In [12]:
# Check the shape of the DataFrame
excel_df.shape

(21, 4)

In [13]:
# Get the list of columns
excel_df.columns

Index(['BusinessEntityID', 'FirstName', 'MiddleName', 'LastName'], dtype='object')

In [14]:
# Check the type of the DataFrame
excel_df.dtypes

BusinessEntityID     int64
FirstName           object
MiddleName          object
LastName            object
dtype: object

## Importing Data from SQL Server

In [None]:
# pip install pyodbc or pip install pyodbc==4.0.31

In [15]:
import pyodbc

In [16]:
# Creating the connection to SQL
conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=JONATHAN-POLLYN\SQLEXPRESS01;DATABASE=AdventureWorks2016;Trusted_Connection=yes;')
cursor=conn.cursor()

# Executing the query
query = "SELECT BusinessEntityID,JobTitle,BirthDate,Gender,HireDate FROM HumanResources.Employee"
sql_df = pd.read_sql(query,conn)

sql_df.head()



Unnamed: 0,BusinessEntityID,JobTitle,BirthDate,Gender,HireDate
0,1,Chief Executive Officer,1969-01-29,M,2009-01-14
1,2,Vice President of Engineering,1971-08-01,F,2008-01-31
2,3,Engineering Manager,1974-11-12,M,2007-11-11
3,4,Senior Tool Designer,1974-12-23,M,2007-12-05
4,5,Design Engineer,1952-09-27,F,2008-01-06


## Finding and filtering data in DataFrame

In [17]:
# Extract the data from index number 2
sql_df.loc[2]

BusinessEntityID                      3
JobTitle            Engineering Manager
BirthDate                    1974-11-12
Gender                                M
HireDate                     2007-11-11
Name: 2, dtype: object

In [18]:
# Extract the data from index number 4
sql_df.loc[4]

BusinessEntityID                  5
JobTitle            Design Engineer
BirthDate                1952-09-27
Gender                            F
HireDate                 2008-01-06
Name: 4, dtype: object

In [19]:
# Extract the rows from index number 1 to 4
sql_df.loc[1:4]

Unnamed: 0,BusinessEntityID,JobTitle,BirthDate,Gender,HireDate
1,2,Vice President of Engineering,1971-08-01,F,2008-01-31
2,3,Engineering Manager,1974-11-12,M,2007-11-11
3,4,Senior Tool Designer,1974-12-23,M,2007-12-05
4,5,Design Engineer,1952-09-27,F,2008-01-06


In [20]:
# Extract the first 2 rows of the data
sql_df.iloc[0:2]

Unnamed: 0,BusinessEntityID,JobTitle,BirthDate,Gender,HireDate
0,1,Chief Executive Officer,1969-01-29,M,2009-01-14
1,2,Vice President of Engineering,1971-08-01,F,2008-01-31


In [21]:
# Return records where the job title is Design Engineer
sql_df.loc[sql_df['JobTitle'] == 'Design Engineer']

Unnamed: 0,BusinessEntityID,JobTitle,BirthDate,Gender,HireDate
4,5,Design Engineer,1952-09-27,F,2008-01-06
5,6,Design Engineer,1959-03-11,M,2008-01-24
14,15,Design Engineer,1961-05-02,F,2011-01-18


In [22]:
# OR
sql_df[sql_df['JobTitle']=='Design Engineer']

Unnamed: 0,BusinessEntityID,JobTitle,BirthDate,Gender,HireDate
4,5,Design Engineer,1952-09-27,F,2008-01-06
5,6,Design Engineer,1959-03-11,M,2008-01-24
14,15,Design Engineer,1961-05-02,F,2011-01-18


## Grouping and Joining DataFrames

In [24]:
# Grouping data by jobTitle
sql_df.groupby('JobTitle').count()

Unnamed: 0_level_0,BusinessEntityID,BirthDate,Gender,HireDate
JobTitle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Accountant,2,2,2,2
Accounts Manager,1,1,1,1
Accounts Payable Specialist,2,2,2,2
Accounts Receivable Specialist,3,3,3,3
Application Specialist,4,4,4,4
...,...,...,...,...
Stocker,3,3,3,3
Tool Designer,2,2,2,2
Vice President of Engineering,1,1,1,1
Vice President of Production,1,1,1,1


In [25]:
# Group data by job title and select the gender
sql_df.groupby('JobTitle').count()['Gender']

JobTitle
Accountant                        2
Accounts Manager                  1
Accounts Payable Specialist       2
Accounts Receivable Specialist    3
Application Specialist            4
                                 ..
Stocker                           3
Tool Designer                     2
Vice President of Engineering     1
Vice President of Production      1
Vice President of Sales           1
Name: Gender, Length: 67, dtype: int64

In [26]:
# Group data by multiple columns
sql_df.groupby(['BirthDate','HireDate']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,BusinessEntityID,JobTitle,Gender
BirthDate,HireDate,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1951-10-17,2011-01-04,1,1,1
1952-03-02,2010-02-05,1,1,1
1952-05-12,2010-01-23,1,1,1
1952-09-27,2008-01-06,1,1,1
1953-04-30,2010-01-22,1,1,1
...,...,...,...,...
1990-11-01,2009-01-27,1,1,1
1990-11-04,2009-01-17,1,1,1
1991-01-04,2009-01-10,1,1,1
1991-04-06,2009-02-15,1,1,1


In [31]:
# Join the Excel df with the SQL df - using inner join
inner_df = pd.merge(sql_df,excel_df, how='inner', on='BusinessEntityID')

In [34]:
inner_df.head()

Unnamed: 0,BusinessEntityID,JobTitle,BirthDate,Gender,HireDate,FirstName,MiddleName,LastName
0,38,Production Technician - WC60,1966-12-14,F,2010-01-16,Kim,B,Abercrombie
1,121,Shipping and Receiving Supervisor,1972-09-09,M,2009-01-02,Pilar,G,Ackerman
2,211,Quality Assurance Manager,1977-10-26,M,2009-02-28,Hazem,E,Abolrous
3,285,Pacific Sales Manager,1975-01-11,M,2013-03-14,Syed,E,Abbas


In [33]:
 # Check the shape of the data
inner_df.shape

(4, 8)

In [35]:
# Join the Excel df with the SQL df - using left join
left_df = pd.merge(sql_df,excel_df, how='left', on='BusinessEntityID')

In [36]:
# View the data
left_df.head()

Unnamed: 0,BusinessEntityID,JobTitle,BirthDate,Gender,HireDate,FirstName,MiddleName,LastName
0,1,Chief Executive Officer,1969-01-29,M,2009-01-14,,,
1,2,Vice President of Engineering,1971-08-01,F,2008-01-31,,,
2,3,Engineering Manager,1974-11-12,M,2007-11-11,,,
3,4,Senior Tool Designer,1974-12-23,M,2007-12-05,,,
4,5,Design Engineer,1952-09-27,F,2008-01-06,,,


In [37]:
# Check the shape of the data
left_df.shape

(290, 8)

In [46]:
# Check the shape of the data
left_df.shape

(290, 8)

In [38]:
# Join the Excel df with the SQL df - using right join
right_df = pd.merge(sql_df,excel_df,how='right', on='BusinessEntityID')

In [39]:
# View the first 5 rows of the data
right_df.head()

Unnamed: 0,BusinessEntityID,JobTitle,BirthDate,Gender,HireDate,FirstName,MiddleName,LastName
0,285,Pacific Sales Manager,1975-01-11,M,2013-03-14,Syed,E,Abbas
1,293,,,,,Catherine,R.,Abel
2,38,Production Technician - WC60,1966-12-14,F,2010-01-16,Kim,B,Abercrombie
3,211,Quality Assurance Manager,1977-10-26,M,2009-02-28,Hazem,E,Abolrous
4,121,Shipping and Receiving Supervisor,1972-09-09,M,2009-01-02,Pilar,G,Ackerman


In [40]:
# View the last 5 rows of the data
right_df.tail()

Unnamed: 0,BusinessEntityID,JobTitle,BirthDate,Gender,HireDate,FirstName,MiddleName,LastName
16,5055,,,,,Eduardo,A,Albright
17,16858,,,,,Elijah,L,Albury
18,3889,,,,,Fernando,S,Alcorn
19,301,,,,,Frances,B.,Alderson
20,16850,,,,,Gabriel,S,Alexander


In [45]:
# Check the shape of the data
right_df.shape

(21, 8)

In [42]:
# Join the Excel df with the SQL df - using full join
full_df = pd.merge(sql_df, excel_df, how='outer', on='BusinessEntityID')

In [43]:
# View the first 5 rows of the data
full_df.head()

Unnamed: 0,BusinessEntityID,JobTitle,BirthDate,Gender,HireDate,FirstName,MiddleName,LastName
0,1,Chief Executive Officer,1969-01-29,M,2009-01-14,,,
1,2,Vice President of Engineering,1971-08-01,F,2008-01-31,,,
2,3,Engineering Manager,1974-11-12,M,2007-11-11,,,
3,4,Senior Tool Designer,1974-12-23,M,2007-12-05,,,
4,5,Design Engineer,1952-09-27,F,2008-01-06,,,


In [44]:
# Check the shape of the data
full_df.shape

(307, 8)

THE END