# pandas

Pandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib.

The Pandas library in Python is a tool for working with data, especially tables — kind of like a spreadsheet in Excel or a database table.

It helps you:

- Read and write data from files (like CSVs or Excel)
- Organize data in rows and columns (called a DataFrame)
- Clean, filter, sort, and analyze that data easily

In [1]:
import pandas as pd
import numpy as np

In [None]:
# This reads a .csv file and prints the first few rows.
data = pd.read_csv(filepath_or_buffer="MOCK_DATA.csv")
print(data.head())

   customer_id first_name last_name                     email      gender  \
0            1     Raynor     Barck       rbarck0@foxnews.com        Male   
1            2    Auroora   Wadwell  awadwell1@purevolume.com      Female   
2            3     Carlie  Izkovici  cizkovici2@newyorker.com        Male   
3            4     Phylys   Mangeot        pmangeot3@fema.gov      Female   
4            5       Dede   Physick       dphysick4@chron.com  Polygender   

   phone_number   birthdate         city    country  
0  894-686-4011  12/15/1987  Los Nogales     Mexico  
1  724-684-7213   6/13/1987       Kanuma      Japan  
2  425-532-8352   6/14/1971    Penisihan  Indonesia  
3  572-963-2456    9/7/1979     Bundoran    Ireland  
4  112-398-9211   8/29/1975         Layo       Peru  


# DataFrame
A table of rows and columns, like a spreadsheet or a SQL table.
It’s one of the main data structures in Pandas, and it lets you store and work with structured data easily.

In [None]:
# Create a DataFrame
# 3 parameters for creating DataFrame - Data, Index(Rows) and Columns
df = pd.DataFrame(np.arange(0,20).reshape(5, 4), index=['Row1', 'Row2', 'Row3', 'Row4', 'Row5'], columns=['Column1', 'Column2', 'Column3', 'Column4'])
print(df.head())

# Save Dataframe to as .csv
#df.to_csv('test_data.csv')

df.loc['Row1']

      Column1  Column2  Column3  Column4
Row1        0        1        2        3
Row2        4        5        6        7
Row3        8        9       10       11
Row4       12       13       14       15
Row5       16       17       18       19


Column1    0
Column2    1
Column3    2
Column4    3
Name: Row1, dtype: int64

A DataFrame in the pandas library for Python is a two-dimensional, labeled data structure with columns of potentially different types. It can be thought of as a table, similar to a spreadsheet in Excel or a table in a SQL database. 

Key characteristics of a pandas DataFrame:
- Two-dimensional: It organizes data into rows and columns.
- Labeled axes: Both rows (index) and columns have labels, which allows for easy access and manipulation of data.
- Heterogeneous data types: Columns can contain different data types (e.g., integers, strings, floats, booleans, dates).
- Size-mutable: DataFrames can be modified by adding or removing rows and columns.
- Collection of Series: Internally, a DataFrame can be viewed as a dictionary-like container of pandas Series objects, where each Series represents a column.

Timeseries - a row or column having squence of value

DataFrame - More than 1 rows / timeseries

In [None]:
# Accessing the Element
# Loc - Label Based Indexing - Row Based indexing
print(f"\n Row based Indexing : {df.loc['Row1']}") # Selects first Row

# Loc - Label Based Indexing - Column Based indexing
print(f"\n\nColumn Based Indexing : {df['Column1']}")
print(f"\n\nTwo Column Based Indexing : {df[['Column1', 'Column2']]}")


# timeseries - a row or column having squence of value
print(f"Type of Row 1 : {type(df.loc['Row1'])}")

# iloc - Both row and Columns
print(df.iloc[:,:]) # All Rows and Columns

print("\n\n", df.iloc[0:2,:]) # First 2 Rows


 Row based Indexing : Column1    0
Column2    1
Column3    2
Column4    3
Name: Row1, dtype: int64


Column Based Indexing : Row1     0
Row2     4
Row3     8
Row4    12
Row5    16
Name: Column1, dtype: int64


Two Column Based Indexing :       Column1  Column2
Row1        0        1
Row2        4        5
Row3        8        9
Row4       12       13
Row5       16       17
Type of Row 1 : <class 'pandas.core.series.Series'>
      Column1  Column2  Column3  Column4
Row1        0        1        2        3
Row2        4        5        6        7
Row3        8        9       10       11
Row4       12       13       14       15
Row5       16       17       18       19


       Column1  Column2  Column3  Column4
Row1        0        1        2        3
Row2        4        5        6        7


In [28]:
# Convert DataFrames into Array
print("\n", df.iloc[:,1:]) # all Rows first column onwards
print("\n Conver Data Frames to Array", df.iloc[:,1:].values)
print("\n Array Shape", df.iloc[:,1:].values.shape)


       Column2  Column3  Column4
Row1        1        2        3
Row2        5        6        7
Row3        9       10       11
Row4       13       14       15
Row5       17       18       19

 Conver Data Frames to Array [[ 1  2  3]
 [ 5  6  7]
 [ 9 10 11]
 [13 14 15]
 [17 18 19]]

 Array Shape (5, 3)


In [None]:
# Get count of occurance of each value in Column 1
print(f"Get Value Counts: {df['Column1'].value_counts()}") 

# get Unique values from Column1
print(f"Get Unique Values: {df['Column1'].unique()}") 

Get Value Counts: Column1
0     1
4     1
8     1
12    1
16    1
Name: count, dtype: int64
Get Unique Values: [ 0  4  8 12 16]
