## Exploratory Data Analysis

This notebook will be a place for me to revise the EDA modules and topics that I have learned earlier in the past months, and it will serve as a testing space where I can practice new EDA concepts and modules. It will mailnly cover the following:
* Numpy
* Pandas
* Matplotlib
* Seaborn

## NumPy

Numpy module is to be quickely reviewed here.

In [1]:
import numpy as np

### Creating Arrays

In [2]:
# Create an array with two dimentions:

arr = np.array([1,2,3,4,5], ndmin = 2)       

print(arr)

# Check the dimention:

arr.ndim

[[1 2 3 4 5]]


2

In [3]:
# Creating arrays with various dimentions:

arr1 = np.array(50)

arr2 = np.array([50])

arr3 = np.array([[1,2,3],[4,5,6]])

arr4 = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])

print(arr1.ndim)
print(arr2.ndim)
print(arr3.ndim)
print(arr4.ndim)

0
1
2
3


### Array Indexing

In [4]:
# Defining the array:

arr = np.array([1, 2, 3, 4, 5])

# print the third item in the list:

print(arr[2])

3


In [5]:
arr = np.array([[10, 20, 30, 40], [50, 60, 70, 80]])

# Print the number 50 from the array:

print(arr[1,0])

50


In [6]:
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

# Access the third element of the second array of the first array:

arr[0,1,2]

6

## Array Slicing

In [7]:
arr = np.array([1, 2, 3, 4, 5, 6, 7])

# Task One: Slice from the index 3 from the end to index 1 from the end:

print('First Task Solution: ', arr[-3:-1])

# Task Two: Return every other element from index 1 to index 5:

print('SEcond Task Solution: ', arr[1:5:2])

# Task Three: Return every other element from the entire array:

print('Third Task Solution: ', arr[::2])

First Task Solution:  [5 6]
SEcond Task Solution:  [2 4]
Third Task Solution:  [1 3 5 7]


In [8]:
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

# Task One: From the second element, slice elements from index 1 to index 4 (not included).

print('First Task Solution: ', arr[1,1:4])

# Task Two: From both elements, return index 2.

print('Second Task Solution: ', arr[:,2])

# Task Three: From both elements, slice index 1 to index 4 (not included).

print('Third Task Solution: ', arr[:,1:4])

First Task Solution:  [7 8 9]
Second Task Solution:  [3 8]
Third Task Solution:  [[2 3 4]
 [7 8 9]]


## Data Types

In [9]:
# Task One: Check the Data Type of the following Array:

arr = np.array([1, 2, 3, 4])

print('First Task Solution: ', arr.dtype, '\n')

# Task Two: Create an Array With a Defined Data Type:

arr = np.array([1,2,3,4], dtype = 'f')

print('Second Task Solution: ', 'Array:', arr, 'Data Type:', arr.dtype, '\n')

# Note: For i, u, f, S and U we can define size as well.

# Task Three: Create an Array With a 4-byte Integer Data Type:

arr = np.array([1,2,3,4], dtype = 'i4')

print('Third Task Solution: ', 'Array:', arr, 'Data Type:', arr.dtype, '\n')

# Task Four: Change the Data Type of an Array:

arr = np.array([1,0,-1])

boolean = arr.astype(bool)

floating = arr.astype('f')

print('Fourth Task Solution: ', 'Array With Boolean Data Type:', boolean, 'Array With Float Data Type:', floating)

First Task Solution:  int64 

Second Task Solution:  Array: [1. 2. 3. 4.] Data Type: float32 

Third Task Solution:  Array: [1 2 3 4] Data Type: int32 

Fourth Task Solution:  Array With Boolean Data Type: [ True False  True] Array With Float Data Type: [ 1.  0. -1.]


## Array Copy vs View
- The copy owns the data, and changes made to it or the original doesn't affect the other.
- The view doesn't own the data, and changes made to it or the original affect the other.

In [10]:
# Let's try the copy method:

arr = np.array([1,2,3,4])

copy = np.copy(arr)

print('Original: ', arr, 'Copy: ', copy)

copy[1] = 50
arr[2] = 100

print('Original: ', arr, 'Copy: ', copy)

Original:  [1 2 3 4] Copy:  [1 2 3 4]
Original:  [  1   2 100   4] Copy:  [ 1 50  3  4]


In [11]:
# Let's try the view method:

arr = np.array([1,2,3,4])

view = arr.view()

print('Original: ', arr, 'view: ', view)

view[1] = 50
arr[2] = 100

print('Original: ', arr, 'view: ', view)

Original:  [1 2 3 4] view:  [1 2 3 4]
Original:  [  1  50 100   4] view:  [  1  50 100   4]


In [12]:
arr = np.array([1,2,3,4])

x = arr.copy()
y = arr.view()

print(x.base)
print(y.base)

None
[1 2 3 4]


#### Coclusion: These two blocks of code prove the previous claims that copy owns the data and view doesn't, and that changing the copy or the original doesn't affect the other while changing the view or the original does.

## Array Shape
#### The shape of an array is the number of elements in each dimension.

In [13]:
arr = np.array([[1, 2, 3, 4], [5, 6, 7,8]])

## The first dimention has two elements, and four elements are present in the second. Thus (2,4) is expected

arr.shape

(2, 4)

In [14]:
# Create an array with 5 dimensions using a vector with values 1,2,3,4 and verify that last dimension has value 4:
arr = np.array([1,2,3,4], ndmin = 5)
print('Array:', arr,', ' 'Shape:', arr.shape)

Array: [[[[[1 2 3 4]]]]] , Shape: (1, 1, 1, 1, 4)


## Array Reshaping
* Reshaping means changing the shape of an array.
* By reshaping we can add or remove dimensions or change number of elements in each dimension.
* The number of elements must be preserved
#### Examples:

In [15]:
# 1) 1D-2D:

before = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
after = before.reshape(3,-1)                    # -1 let's numpy calculate the last number of elements in that dimention

print('The Original Shape: ', before.shape, ',', 'The Resulting Shape: ', after.shape, '\n')

# 2) 1D-3D:

after2 = before.reshape(3,2,-1)                  

print('The Original Shape: ', before.shape, ',', 'The Resulting Shape: ', after2.shape,'\n')

# 3) Flattenning an Array (Reducing it to one dimention):

after3 = after2.reshape(-1)

print('The Original Shape: ', after2.shape, ',','The Resulting Array, Shape: ', after3,', ',after3.shape)

The Original Shape:  (12,) , The Resulting Shape:  (3, 4) 

The Original Shape:  (12,) , The Resulting Shape:  (3, 2, 2) 

The Original Shape:  (3, 2, 2) , The Resulting Array, Shape:  [ 1  2  3  4  5  6  7  8  9 10 11 12] ,  (12,)


## Array Iterating

#### It is exactly like regular lists, but with high dimentions it becomes tedious. Hence, specific helping functions are used. Those functions are discussed below:

In [16]:
# 1) Regular methods:

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

for x in arr:
    for y in x:
        for z in y:
            pass              
            #print(z)                        # Uncomment the moethod that you want to see.

# 2) Using nditer():

for x in np.nditer(arr):
    #print(x)
    pass

# 3) Using ndenumerate():

for idx, x in np.ndenumerate(arr[:,:,::2]):  # You can slice however you want as well.
    print(idx,x)
    pass

(0, 0, 0) 1
(0, 0, 1) 3
(0, 1, 0) 4
(0, 1, 1) 6
(1, 0, 0) 7
(1, 0, 1) 9
(1, 1, 0) 10
(1, 1, 1) 12


## End of NumPy Foundations

#### Here is all the foundation I need for now, and I will practice the remaining topics in the link below once I see the need for it. 
https://www.w3schools.com/python/numpy/numpy_array_join.asp

## Pandas
#### I studied pandas to a good extent in the past, hence I will go throught the foundations very quickly, and jump straight into excercises and mini projects after that.

In [17]:
import pandas as pd

In [18]:
pd.__version__

'1.4.2'

## Series

In [19]:
# Creating a series from a list:

names = ['Hashem', 'Motaz', 'Eissa', 'Omar']
idx = ['Leader', 'Great', 'Funny', 'Blood']

a = pd.Series(names,index=idx)
a

Leader    Hashem
Great      Motaz
Funny      Eissa
Blood       Omar
dtype: object

In [20]:
# accessing the values in the series:

print(a['Leader'] == a[0])

a.iloc[0]    # a[0] works the same way

True


'Hashem'

In [21]:
# creating a series from a dictionary:

food = {'fruit': 'Apple', 'Vegetable': 'Cucumber', 'Meat': 'Stake'}

pd.Series(food, index = ['fruit', 'Meat'])         # You don't have to use the entire dictionary

fruit    Apple
Meat     Stake
dtype: object

## DataFrames

In [22]:
# creating a DataFrame from a dictionary:

food = {'Fruits': ['Apple','Orange','Bananas'], 'Vegetables': ['Cucumber','Potatoes','Corn'],
        'Meats': ['Stake', 'Chicken', 'Fish']}

df = pd.DataFrame(food, index=['Good', 'Nuetral',' Bad'])
df

Unnamed: 0,Fruits,Vegetables,Meats
Good,Apple,Cucumber,Stake
Nuetral,Orange,Potatoes,Chicken
Bad,Bananas,Corn,Fish


In [23]:
# Accessing the DataFrame

print(df.loc['Good','Fruits'] == df.iloc[0,0])

print(df.iloc[0,0])

df.iloc[:,0:2]

True
Apple


Unnamed: 0,Fruits,Vegetables
Good,Apple,Cucumber
Nuetral,Orange,Potatoes
Bad,Bananas,Corn


In [24]:
# reading from a csv file:

df = pd.read_csv('data.csv')

df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


In [25]:
# Displaying the entire DataFrame(Changing the maximum number of displayed rows):

# pd.options.display.max_rows = 10000    # uncomment to apply what you want 

# df

In [26]:
# Simple analysis of the DataFrame:

# df.head(5), df.tail(4)
# df.info(), df.describe()

# uncomment whatever you want and see what you can understand about the data.

#### Now The Real Stuff:
1. Cleaning The Data
2. 

In [27]:
# 1) Cleaning The Data: