Putting Some Pandas In Your Python 🐼
# Introduction to Pandas 🐼

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.

Reference: https://pandas.pydata.org/docs/getting_started/index.html

- Question: What are the Data Structures in Pandas?
- Answer: Series (similar to 1 dim numpy array) and DataFrame (similar to 2 dim numpy array)

Installation Command
! pip install pandas

Importing Pandas
import pandas as pd

### What's covered in this notebook?
1. Pandas Data Structure - Series (ndarray-like)
Creating Series using Python list or dict
Creating Series from Numpy ndarray
Creating Series from scalar
Accessing Properties/Attributes and Methods of Series
Accessing data using Indexing and Slicing
2. Pandas Data Structure - DataFrame
Creating DataFrame using Python dict, list or tuple
Creating DataFrame using Numpy Array
Accessing Attributes/Properties and Methods of DataFrame
3. Working with Tabular Data
Dataframe to .csv & .xlsx
Reading .xlsx File
Reading .csv File - Iris Dataset
4. Non-Visual Data Analysis using Pandas (Statistical Analysis)
sum()
min() and max()
mean(), median(), var() and std()
describe() to summarize the data
corr(), skew() and kurt()
count(), unique() and value_counts() for categorical column
DataFrame.agg()
5. Accessing Data in a DataFrame using Indexing and Slicing in Pandas DataFrame
Reading .csv File - Weather Dataset
Filtering Single Column vs Multiple Columns from a DataFrame
Filtering Rows from a DataFrame
Filtering specific rows and columns from a DataFrame
loc() vs iloc()
6. Renaming Columns, Modifying DataTypes, Creating New Columns and Deleting Columns in Pandas DataFrame
Reading .csv File - Retail Store Sales Data
Renaming Columns
Modifying Columns DataTypes
Creating a Derived Column
Creating columns using apply() function
Deleting column(s) in DataFrame
7. Adding/Inserting Row(s)
Reading .xlsx File - Weather Data
Insert Row(s) using pandas.concat()
Inserting a Row using List - .loc[] and .iloc[]
Inserting a Row at a Specific Index of a DataFrame
Saving DataFrame to .xlsx
8. Handling TimeSeries Data
Reading .csv File - Online Store Sales Data
pd.to_datetime()
Working with DateTime in Pandas
Creating a Column containing only the Order Month
Calculating Delivery Time from Order Date and Ship Date
pandas.Timedelta
Creating a Column containing Delivery Time in Number of Days
Improve Performance by Setting Date Column as the Index
Sorting Data Based on Index vs Values and Resetting Index
9. Summary

Q1. Import Pandas Module and numpy module

Q2. Create Series using Python list or dict

Q3. Create Series from Numpy ndarray

Q4. Create Series from scalar and the index should start from 1 or a

Accessing Properties/Attributes and Methods of Series

Q1. Create array of numbers, check for the data type, shape, values and the length

Q2. convert to numpy array s = pd.Series([1,2,3,4,5,6,7,8,9]) using to_numpy()

Q3. Access variable using head(),tail() and info()

Accessing data using Indexing and Slicing

Q1. Access the 2nd and 4th elements in s = pd.Series([1, 2, 3, 4, 5]) using the indexing or slicing method

Q2. Create a pandas series with index=['a', 'b', 'c', 'd', 'e']

## Pandas Data Structure - DataFrame

Pandas is a general 2D labeled, value and size-mutable tabular structure with potentially heterogeneously-typed column.

Important Note: Pandas data structures are value-mutable (the values they contain can be altered) as well as size-mutable.

Q1. Create a DataFrame using Python dict, list or tuple to print Name, Age and Gender

Q2. Create a Dataframe using Tuple, assingn it to a variable named data
       ('1/1/2019', 13, 6, 'Rain'),
       ('2/1/2019', 11, 7, 'Fog'),
       ('3/1/2019', 12, 8, 'Sunny'),
       ('4/1/2019', 8, 5, 'Snow'),
       ('5/1/2019', 9, 6, 'Rain')
Note: The minimum rows should be 50

Q2. Create a column for Q2: Day, Temperature, Windspeed and Event

Q3. print the Temperature column

Creating DataFrame using Numpy Array

Q1. Create an array using random.randint. choose the range and the size should be (1000,100)

Q2. Convert Q1 to a DataFrame 

Q3. Write a program to output col_1 to col_100 for the Dataframe in Q2
Hint: col_ + str(i) followed by a for loop

Accessing Attributes/Properties and Methods of DataFrame

Q1. Create a dictionary of series, assign it to a variable name data

Q2. Print the shape, column, data type, axes and value of Q1

Q3. Check for the dataframe information

Q4. Output the last 2 row in the dataframe

### Working with Tabular Data

Remember
Getting data in to pandas from many different file formats or data sources is supported by read_* functions.
Exporting data out of pandas is provided by different to_* methods.
The head/tail/info methods and the dtypes attribute are convenient for a first check.

Q1. Create an array using random.randint. choose the range and the size should be (1000,100)

Q2. Write Q1 dataframe to csv with the name new_data.csv without index

Q3. Write Q1 dataframe to xsls with the sheet_name = new_data.xlsx without index

Q4. Read the data in the temp/iris.csv. use head(),tail(),info() and dtype

The iris data set is widely used as a beginner's dataset for machine learning purposes.

##### Non-Visual Data Analysis using Pandas (Statistical Analysis)
groupby provides the power of the split-apply-combine pattern.
value_counts is a convenient shortcut to count the number of entries in each category of a variable.

Q5. print the columns in Q3

Q6. Sum each columns, output the min, max, mean, median, var, std

count(), nunique(), unique() and value_counts() for categorical column

Q7. Use the above for analysing the data in Q3

Q8. Describe the data and include(object)

corr(), skew() and kurt()

df.agg(
    {
        "SepalLengthCm" : ["min", "max", "median", "count"],
        "PetalWidthCm" : ["min", "max", "mean", "count"],
        "Species" : ["count"]
    }
)

##### Accessing Data in a DataFrame using Indexing and Slicing in Pandas DataFrame
You can assign new values to a selection based on loc/iloc.

Reading .csv File - Weather Dataset

Q1. Read the weather Dataset(nyc_weather.csv)

Q2. Access the data using head(), tail(), info, columns and shape

Q3. Check for the min and max Temperature

Q4. Select Multiple column(Temerature, Deewpoint and Humidity)

Q5. Filter the rows using Slicing method. Row 1 to 30

Q6. Check for in the Temperature column values greater than 30

Q7. Show the whole Dataframe for Q6 which shows the index of values greater than 30

Q8. Using isin: check if '1/10/2016', '1/16/2016', '1/2/2016' isin EST

Q9. Use loc and iloc to print row 30

Q10. Show EST and DewPoint for Q9

Reading .csv File - Retail Store Sales Data

Q1. Read retail_store_sales.xlsx file after importing the necessary module

Q2. What comes to your mind immediately after looking at the dataset?

Answer the following:
How many sales records do we have in the dataset?
How many customers do we have?
What is the date range of data?
Which country recorded maximum sales count?
What is the minimum order amount and maximum order amount?
How many orders for each customer?
What is the revenue contributed by each customer?
What is the revenue generated each year?
Which customer contributed to the maximum revenue each year and how much?
Are there more orders placed on weekends?
How many customers churned (i.e. Customers not making any purchases for more than or equal to 2 months)?

NB:
Try to understand that as a data analyst, first we should be capable to ask right questions. Answering these questions can be done with the help of Pandas module.

Q3. Rename the column using the code below 

In [20]:
#col_names = [ col.strip().lower().replace(' ', '_').replace('-', '_') for col in df_renamed ]


['invoice_no', 'stock_code', 'product_description', 'quantity', 'invoice_date', 'unit_price', 'cust_id', 'country']


Q4. Convert all columns to string type using astype(str)

Q5. Convert quantity,unit_price,cust_id

Q6. Convert ship date and order date to datetime

Q7. Apply a function on the complete column at once
ex:df_renamed[['amount']].apply(np.mean)

Q8. Create a new column using apply() and lambda 

Deleting column(s) in DataFrame

Q1. Delete the amount column using drop and axis=1

Q2. Use iloc to delete a column

Q3. Use pop to delete a column