<a href="https://colab.research.google.com/github/DeepNets-US/Hands-on-EDA/blob/main/1.%20NumPy_%26_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NumPy**

Many operations present below will seem obvious if you have already done numpy, although it's very important to go through the following section as it forms the very base of statistical analysis and operations in EDA.

## **Arrays & Methods**

In [None]:
# Importing NumPy
import numpy as np

In [None]:
# 1D Array
array_1d = np.array(range(1,17))
array_1d

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16])

In [None]:
# 2D Array
array_2d = array_1d.reshape(4,4)
array_2d

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12],
       [13, 14, 15, 16]])

In [None]:
# 3D Array
array_3d = array_1d.reshape(2,4,2)
array_3d

array([[[ 1,  2],
        [ 3,  4],
        [ 5,  6],
        [ 7,  8]],

       [[ 9, 10],
        [11, 12],
        [13, 14],
        [15, 16]]])

In [None]:
# Get Memory Address**
print(f"Memory Address: {array_1d.data}")

# Shape of the data
print(f"Shape: {array_1d.shape}")

# Data type
print(f"Data Type: {array_1d.dtype}")

# Strides of the array**
print(f"Strides: {array_1d.strides}")

# Item Size**
print(f"Item Size: {array_1d.itemsize}")


Memory Address: <memory at 0x7eb11835b280>
Shape: (16,)
Data Type: int64
Strides: (8,)
Item Size: 8


In [None]:
# Shape of the 2D Aray
print(f"Shape: {array_2d.shape}")

# Strides of the array**
print(f"Strides: {array_2d.strides}")

# Item Size**
print(f"Item Size: {array_2d.itemsize}")

Shape: (4, 4)
Strides: (32, 8)
Item Size: 8


The tuple of strides indicates the number of bytes to move from one element to the next along each dimension of a multi-dimensional array. The order of values in the tuple corresponds to the dimensions of the array.

* $32$: This 32 corresponds to $Item Size$ $x$ $Number Of Rows$ (8 * 4). That is, the total amount of ***bytes*** required to move from one row to the next row.

* $8$: This represents the bytes required to move from one column to the next column within the same row.

## **Array from Built-In NumPy Functions**

In [None]:
# Array of ones
shape = (2, 2)
print(f"Ones: {np.ones(shape)}\n")

# Array of zeros
print(f"Zeros: {np.zeros(shape)}\n")

# Array of random values
print(f"Random Values: {np.random.random(shape)}\n")

# An empty array
print(f"Empty: {np.empty(shape)}\n")

# An array full of Ks (called a Full Array)
k = 5
print(f"Full: {np.full(shape, k)}\n")

# Aranged Array
start, stop, step = 5, 25, 3
print(f"Arrange: {np.arange(start, stop, step)}\n")

# Linspace
lower_bound, upper_bound, n = 10, 20, 20
print(f"Linspace: {np.linspace(lower_bound, upper_bound, n)}")

Ones: [[1. 1.]
 [1. 1.]]

Zeros: [[0. 0.]
 [0. 0.]]

Random Values: [[0.74688629 0.62653973]
 [0.38560424 0.67357972]]

Empty: [[0.74688629 0.62653973]
 [0.38560424 0.67357972]]

Full: [[5 5]
 [5 5]]

Arrange: [ 5  8 11 14 17 20 23]

Linspace: [10.         10.52631579 11.05263158 11.57894737 12.10526316 12.63157895
 13.15789474 13.68421053 14.21052632 14.73684211 15.26315789 15.78947368
 16.31578947 16.84210526 17.36842105 17.89473684 18.42105263 18.94736842
 19.47368421 20.        ]


## **I/O Operations**

In [None]:
# Saving data to a txt file
data = np.arange(10, 50, 6)

# Saving data
np.savetxt("data.txt", data, delimiter=",")

print(f"Data: {data}")

Data: [10 16 22 28 34 40 46]


In [None]:
# Loading data from memory
np.loadtxt("data.txt", delimiter=",", unpack=True)

array([10., 16., 22., 28., 34., 40., 46.])

In [None]:
# Using "genfromtxt"
np.genfromtxt("data.txt", skip_header=1, filling_values=-999)

array([16., 22., 28., 34., 40., 46.])

## **Inspecting Arrays**

In [None]:
print(f"N Dims     : {array_3d.ndim}")
print(f"Shape      : {array_3d.shape}")
print(f"Size       : {array_3d.size}")
print(f"Item Size  : {array_3d.itemsize}")
print(f"Bytes Consumed : {array_3d.nbytes}")
print(f"Memory layout  : \n{array_3d.flags}")

N Dims     : 3
Shape      : (2, 4, 2)
Size       : 16
Item Size  : 8
Bytes Consumed : 128
Memory layout  : 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False



1. **C_CONTIGUOUS**: This attribute indicates whether the array data is in C-style contiguous order. In C-style order, the last axis changes the fastest (row-major order). If `C_CONTIGUOUS` is True, it means the array is stored in a way that is efficient for C-style indexing and operations.

2. **F_CONTIGUOUS**: This attribute indicates whether the array data is in Fortran-style contiguous order. In Fortran-style order, the first axis changes the fastest (column-major order). If `F_CONTIGUOUS` is True, it means the array is stored in a way that is efficient for Fortran-style indexing and operations.

3. **OWNDATA**: This attribute is True if the array owns its own data, meaning it is not a view or a copy of another array's data. If `OWNDATA` is False, it indicates that the array shares its data with another array.

4. **WRITEABLE**: This attribute indicates whether the array data can be modified. If `WRITEABLE` is True, you can modify the data in the array. If False, it means the array is read-only.

5. **ALIGNED**: This attribute is True if the data is properly aligned for SIMD (Single Instruction, Multiple Data) instructions. Alignment can affect the performance of certain operations on the array.

6. **WRITEBACKIFCOPY**: This attribute is True if this array is a copy of some other array, and if modifications to this array should be written back to the original array. If False, modifications are not written back.

## **Brodcasting**

Broadcasting is the mechanism in NumPy that allows operation between arrays of different sizes by replicating the smaller array to match with the bigger array.

In [None]:
a = np.random.random((2,2))
b = np.random.random(2)

print(
    f"A: {a}\n\n",
    f"B: {b}"
)

A: [[0.6722502  0.20366863]
 [0.71815131 0.83092016]]

 B: [0.03538458 0.86494721]


In [None]:
a + b

array([[0.70763478, 1.06861584],
       [0.75353588, 1.69586737]])

In [None]:
a - b

array([[ 0.63686563, -0.66127859],
       [ 0.68276673, -0.03402705]])

In [None]:
a * b

array([[0.02378729, 0.17616261],
       [0.02541148, 0.71870207]])

In [None]:
a @ b

array([0.1999499 , 0.74411355])

In [None]:
a / b

array([[18.99839638,  0.23546943],
       [20.29560303,  0.96065996]])

In [None]:
a // b

array([[18.,  0.],
       [20.,  0.]])

In [None]:
a % b

array([[0.03532783, 0.20366863],
       [0.01045979, 0.83092016]])

## **NumPy Slicing & Subsets**

In [None]:
array_3d

array([[[ 1,  2],
        [ 3,  4],
        [ 5,  6],
        [ 7,  8]],

       [[ 9, 10],
        [11, 12],
        [13, 14],
        [15, 16]]])

In [None]:
array_3d[1]

array([[ 9, 10],
       [11, 12],
       [13, 14],
       [15, 16]])

In [None]:
array_3d[0, :, 0]

array([1, 3, 5, 7])

In [None]:
array_3d[array_3d >= 5]

array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16])

# **Pandas**

🕰️ **History:**
Introduced in 2008 by Wes McKinney, Pandas has evolved into a powerhouse for data manipulation and analysis in Python. Its inception aimed to bridge the gap between powerful data analysis tools and Python's simplicity, making data tasks more accessible.

🚀 **Present Impact:**
Today, Pandas is an integral part of the Python data science ecosystem. With its intuitive data structures, DataFrame and Series, Pandas simplifies tasks like cleaning, aggregating, and transforming data. Widely adopted in academia, industry, and open-source projects, it continues to shape the landscape of data handling.

🛠️ **Use Cases:**
Pandas finds its application across diverse domains:
- **Data Cleaning:** Effortlessly handle missing data, duplicates, and outliers.
- **Exploratory Data Analysis (EDA):** Uncover insights with powerful data manipulation and visualization.
- **Data Transformation:** Easily reshape and pivot datasets for analysis.
- **Time Series Analysis:** Seamless handling of time-stamped data for financial, IoT, or scientific applications.
- **Data Integration:** Merge, join, and concatenate datasets with ease.

🌐 **Why Pandas?**
- **User-Friendly:** Intuitive syntax for quick adoption.
- **Flexible Data Structures:** DataFrame and Series cater to diverse data types and structures.
- **Extensive Documentation:** Comprehensive resources for users at all skill levels.
- **Active Community:** Vibrant community support ensures continuous improvement and troubleshooting.

Whether you're a data scientist, analyst, or enthusiast, Pandas is your ally in taming the complexities of data. Dive into its versatility and empower your Python data workflows! 🐼💻

## **Importing Pandas**

In [None]:
# Installing Python Library
!pip install -q pandas

# Importing Module
import numpy as np
import pandas as pd

print(f"Pandas Version: {pd.__version__}")
print(f"NumPy  Version: {np.__version__}")

Pandas Version: 1.5.3
NumPy  Version: 1.25.2


## **Pandas DataFrame**

Pandas introduces two primary data structures: **DataFrame** and **Series.**

> A **Series** serves as a named, one-dimensional array, akin to a list but augmented with dedicated operations. It provides a foundational element for data manipulation, offering labeled data points and streamlined operations.

> Conversely, a **DataFrame** can be envisioned as a composite structure, seamlessly integrating multiple Series. It represents a two-dimensional, tabular data structure with labeled axes (rows and columns), facilitating comprehensive manipulation and analysis of structured data. This dynamic interplay between Series and DataFrame forms the backbone of Pandas, empowering users to navigate, analyze, and transform data with unparalleled flexibility and efficiency.

In [None]:
# Pandas Series
series = pd.Series(range(1,11), name="Numbers", index=range(5,15))
series

5      1
6      2
7      3
8      4
9      5
10     6
11     7
12     8
13     9
14    10
Name: Numbers, dtype: int64

In [None]:
# Data Type
type(series)

### **Creating DataFrame**

There are many ways to create a **DataFrame**, but the best out of all is to create a **DataFrame** via a **dictionary**.

In [None]:
# Pandas Dataframe
series_df = pd.DataFrame({
    "Name":["Alex", "Alice", "Sam", "Tom", "Cat"],
    "Age":[15,78,23,45,98],
    "Sex": ["Female", "Female", "Male", "Male", "Female"],
    "Disorder": pd.Categorical(["Depression", "Social Anxiety", "Bipolar Disorder", "Eating Disorder", "Bipolar Disorder"])
})

# Quick look
series_df

Unnamed: 0,Name,Age,Sex,Disorder
0,Alex,15,Female,Depression
1,Alice,78,Female,Social Anxiety
2,Sam,23,Male,Bipolar Disorder
3,Tom,45,Male,Eating Disorder
4,Cat,98,Female,Bipolar Disorder


### **Loading Data from External Source**

In [None]:
# Data Configurations
columns = [
    'age', 'workclass', 'fnlwgt', 'education', 'education_num',
    'marital_status', 'occupation', 'relationship', 'ethnicity',
    'gender','capital_gain','capital_loss','hours_per_week',
    'country_of_origin','income'
]
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

# Loading an external data
df = pd.read_csv(data_url, names=columns)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,ethnicity,gender,capital_gain,capital_loss,hours_per_week,country_of_origin,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### **DataFrame Operations**

In [None]:
# Displaying information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   age                32561 non-null  int64 
 1   workclass          32561 non-null  object
 2   fnlwgt             32561 non-null  int64 
 3   education          32561 non-null  object
 4   education_num      32561 non-null  int64 
 5   marital_status     32561 non-null  object
 6   occupation         32561 non-null  object
 7   relationship       32561 non-null  object
 8   ethnicity          32561 non-null  object
 9   gender             32561 non-null  object
 10  capital_gain       32561 non-null  int64 
 11  capital_loss       32561 non-null  int64 
 12  hours_per_week     32561 non-null  int64 
 13  country_of_origin  32561 non-null  object
 14  income             32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [None]:
# Selecting rows and columns
df.iloc[:10]                        # Starting 10 rows

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,ethnicity,gender,capital_gain,capital_loss,hours_per_week,country_of_origin,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [None]:
df.iloc[-10:]                        # Last 10 rows

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,ethnicity,gender,capital_gain,capital_loss,hours_per_week,country_of_origin,income
32551,32,Private,34066,10th,6,Married-civ-spouse,Handlers-cleaners,Husband,Amer-Indian-Eskimo,Male,0,0,40,United-States,<=50K
32552,43,Private,84661,Assoc-voc,11,Married-civ-spouse,Sales,Husband,White,Male,0,0,45,United-States,<=50K
32553,32,Private,116138,Masters,14,Never-married,Tech-support,Not-in-family,Asian-Pac-Islander,Male,0,0,11,Taiwan,<=50K
32554,53,Private,321865,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,>50K
32555,22,Private,310152,Some-college,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


In [None]:
df.iloc[:5, [0,2]]                        # 5 Rows and age & education columns

Unnamed: 0,age,fnlwgt
0,39,77516
1,50,83311
2,38,215646
3,53,234721
4,28,338409


In [None]:
# Concatenation of DataFrames
a = pd.DataFrame({'A': range(1,11)})
b = pd.DataFrame({'B': range(11,21)})

c = pd.concat([a, b], axis=1)

# axis = 0 : Rows
# axis = 1 : Columns


In [None]:
# Statistical information
df.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [None]:
# Checking for null values
df.isnull().sum()

age                  0
workclass            0
fnlwgt               0
education            0
education_num        0
marital_status       0
occupation           0
relationship         0
ethnicity            0
gender               0
capital_gain         0
capital_loss         0
hours_per_week       0
country_of_origin    0
income               0
dtype: int64

In [None]:
# Checking for Duplicate Values
df.duplicated().sum()

24

In [None]:
# View the Duplicated values
df[df.duplicated()].head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,ethnicity,gender,capital_gain,capital_loss,hours_per_week,country_of_origin,income
4881,25,Private,308144,Bachelors,13,Never-married,Craft-repair,Not-in-family,White,Male,0,0,40,Mexico,<=50K
5104,90,Private,52386,Some-college,10,Never-married,Other-service,Not-in-family,Asian-Pac-Islander,Male,0,0,35,United-States,<=50K
9171,21,Private,250051,Some-college,10,Never-married,Prof-specialty,Own-child,White,Female,0,0,10,United-States,<=50K
11631,20,Private,107658,Some-college,10,Never-married,Tech-support,Not-in-family,White,Female,0,0,10,United-States,<=50K
13084,25,Private,195994,1st-4th,2,Never-married,Priv-house-serv,Not-in-family,White,Female,0,0,40,Guatemala,<=50K


# **SciPy & Matplotlib**

**SciPy** and **Matplotlib** are other standard libraries in the machine learning domain. However, discussing them here would be a substantial task, as they are huge.

- On one hand, **SciPy** is a powerhouse, encompassing a vast array of algorithms, clustering techniques, and signal preprocessing tools for advanced tasks. It's a comprehensive library that provides a broad range of functionalities.

- Meanwhile, **Matplotlib** excels in data visualization, offering scatter plots, histograms, dot plots, bar plots, box plots, and more.

For those seeking a deeper understanding of [SciPy](https://docs.scipy.org/doc/scipy/tutorial/index.html#user-guide) and [Matplotlib](https://matplotlib.org/stable/users/index.html), it is highly recommended to explore the official documentation pages to unlock the full potential of these libraries.