# In this lesson...

In this lesson, we'll go through the following:
1. [Importing Libraries](#import-libraries)
2. [Working With Mock Data](#mock-data)
2. [Understanding Mock Data](#understand-mock-data)


<hr>

<br id="import-libraries"> 

# <span style='color:blue;'>Import libraries</span>

In general, it's good practice to keep all of your library imports at the top of your notebook or program.

In [1]:
# Collection Libraries
import numpy as np
import pandas as pd

# Data Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Library Configurations: 
sns.set() # make seaborn override the styling of matplotlib graphs
pd.set_option('display.max_columns', None) # Make Pandas display all columns
pd.set_option('display.max_rows', None) # Make Pandas display all rows

<br id="mock-data">

<hr> 


# <span style='color:blue;'>Working Mith Mock Data </span>


## Helper functions To Create Mock Data

### Using Numpy's arange() Function To Generate Random Data:

**[numpy.arange()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.arange.html?highlight=arange#numpy.arange)** will return evenly spaced values within a given interval.

In [2]:
def generate_evenly_spaced_values_numpy_array(start, stop, step):
    """
        Info:
            This function return a numpy array of evenly spaced values within a given interval 
        Params:
            start (type: int)
            stop (type: int) 
            step (type: int)
        Output:
            numpy_array (type: numpy array)
    """
    numpy_array = np.arange(start, stop, step) 
    return numpy_array

### Using Numpy's randint() Function To Generate Random Data:
**[numpy.randint()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.randint.html)** will return random integers based on the “discrete uniform” distribution from low (inclusive) to high (exclusive).


In [3]:
def generate_random_array(min_num, max_num, number_of_samples):
    """
        Info:
            This function will return a numpy array of random samples, given a range

        Params:
            min_num, inclusive (type: int)
            max_num, exclusive (type: int) 
            number_of_samples (type: int)
        Output:
            numpy_array (type: numpy array)
    """
    rand_arr = np.random.randint(min_num, max_num, size=number_of_samples)
    return rand_arr

### Using Numpy's random.normal() Function To Generate Random Data:
[numpy.random.normal()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.normal.html) will return random samples from a normal (Gaussian) distribution.

In [4]:
def generate_random_normal_array(mean, std, number_of_samples):
    """
        Info:
            This function will return random samples from a normal (Gaussian) distribution.

        Params:
            mean, (“centre”) of the distribution. (type: float)
            std, Standard deviation (spread or “width”) of the distribution. (type: float)
            number_of_samples (type: int)
        Output:
            numpy_array (type: numpy array)
    """
    rand_normal_arr = np.random.normal(mean, std, number_of_samples)
    return rand_normal_arr

## Creating Mock Data Using Helper Functions

In [5]:
def get_sample_data():
    raw_data = {
    'first_name': ['Mike', 'Robert', 'Peter', 'Scott', 'Harold'], 
    'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
    'unique_id': generate_random_array(1, 100, 5),
    'age': generate_random_array(21, 70, 5),
    'eye_color': ['green', 'brown', 'blue', 'hazel', np.nan],
    'weight': generate_random_normal_array(150, 20, 5),
    'favorite_number': generate_evenly_spaced_values_numpy_array(1,100,20) 
}
    return pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'unique_id', 'age', 'eye_color', 'weight','favorite_number'])

## Helper Functions To Modify Mock Data

In [6]:
def rename_columns(dataframe, column_names):
    """
        Info:
            This function will rename all the columns of a pandas dataframe
        Params:
            dataframe (type: pandas dataframe)
            column_names: (type: List)
        Output:
            dataframe (type: pandas dataframe)
    """
    dataframe.columns = column_names
    return dataframe

In [7]:
def drop_record_by_value(dataframe, column_name, column_value):
    """
        Info:
            This function will drop records that contain a given value in a given column
        Params:
            dataframe: Target dataframe
            column_name: Column of focus
            column_value: Target value
        Output:
            dataframe (type: pandas dataframe)
    """
    new_df = dataframe[dataframe[column_name] != column_value]
    return new_df

## Renaming Columns Using Helper Function

In [8]:
# Get mock data
df = get_sample_data()
print("ORIGINAL DF:\n\n{}\n".format(df.head()))

# Set function params
new_column_names = ['first', 'last', 'id', 'age', 'eye_color', 'weight', 'fav_num']

# Create new df with renamed columns
new_df = rename_columns(df, new_column_names)
print("Updated DF, with new column names:\n\n{}\n".format(df.head()))

ORIGINAL DF:

  first_name last_name  unique_id  age eye_color      weight  favorite_number
0       Mike    Miller         97   22     green  153.449920                1
1     Robert  Jacobson         15   66     brown  150.009764               21
2      Peter       Ali         84   26      blue  144.493227               41
3      Scott    Milner         73   47     hazel  134.402105               61
4     Harold     Cooze         75   34       NaN  151.611098               81

Updated DF, with new column names:

    first      last  id  age eye_color      weight  fav_num
0    Mike    Miller  97   22     green  153.449920        1
1  Robert  Jacobson  15   66     brown  150.009764       21
2   Peter       Ali  84   26      blue  144.493227       41
3   Scott    Milner  73   47     hazel  134.402105       61
4  Harold     Cooze  75   34       NaN  151.611098       81



## Dropping A Row Based Off Of A Column Value Using Helper Function

In [9]:
# Get mock data
df = get_sample_data()
print("ORIGINAL DF:\n\n{}\n".format(df.head()))

# Set function params
column_name = 'eye_color'
column_value = 'hazel'

# Create new df with removed records
new_df = drop_record_by_value(df, column_name, column_value)

print("Updated DF, with dropped row based on specific column value:\n\n{}\n".format(new_df.head()))

ORIGINAL DF:

  first_name last_name  unique_id  age eye_color      weight  favorite_number
0       Mike    Miller         79   45     green  158.979554                1
1     Robert  Jacobson         46   44     brown  166.308877               21
2      Peter       Ali         30   39      blue  135.216292               41
3      Scott    Milner         95   48     hazel  117.544303               61
4     Harold     Cooze         66   33       NaN  147.616301               81

Updated DF, with dropped row based on specific column value:

  first_name last_name  unique_id  age eye_color      weight  favorite_number
0       Mike    Miller         79   45     green  158.979554                1
1     Robert  Jacobson         46   44     brown  166.308877               21
2      Peter       Ali         30   39      blue  135.216292               41
4     Harold     Cooze         66   33       NaN  147.616301               81



<br id="understand-mock-data"> 

<hr>

# <span style='color:blue;'> Understanding The Mock Dataset<span>

## Visualizing the head and tail 

In [10]:
df.head(5)

Unnamed: 0,first_name,last_name,unique_id,age,eye_color,weight,favorite_number
0,Mike,Miller,79,45,green,158.979554,1
1,Robert,Jacobson,46,44,brown,166.308877,21
2,Peter,Ali,30,39,blue,135.216292,41
3,Scott,Milner,95,48,hazel,117.544303,61
4,Harold,Cooze,66,33,,147.616301,81


In [11]:
df.tail(5)

Unnamed: 0,first_name,last_name,unique_id,age,eye_color,weight,favorite_number
0,Mike,Miller,79,45,green,158.979554,1
1,Robert,Jacobson,46,44,brown,166.308877,21
2,Peter,Ali,30,39,blue,135.216292,41
3,Scott,Milner,95,48,hazel,117.544303,61
4,Harold,Cooze,66,33,,147.616301,81


## Checking the shape of the data frame

In [12]:
df.shape

(5, 7)

## Checking the data types of the columns
 

In [13]:
df.dtypes

first_name          object
last_name           object
unique_id            int64
age                  int64
eye_color           object
weight             float64
favorite_number      int64
dtype: object

### More info

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
first_name         5 non-null object
last_name          5 non-null object
unique_id          5 non-null int64
age                5 non-null int64
eye_color          4 non-null object
weight             5 non-null float64
favorite_number    5 non-null int64
dtypes: float64(1), int64(3), object(3)
memory usage: 360.0+ bytes


## 5 summary statistic with numerical dtypes 

In [15]:
df.describe()

Unnamed: 0,unique_id,age,weight,favorite_number
count,5.0,5.0,5.0,5.0
mean,63.2,41.8,145.133065,41.0
std,25.820534,5.890671,19.402248,31.622777
min,30.0,33.0,117.544303,1.0
25%,46.0,39.0,135.216292,21.0
50%,66.0,44.0,147.616301,41.0
75%,79.0,45.0,158.979554,61.0
max,95.0,48.0,166.308877,81.0


## 5 summary statistic with categorical dtypes 

In [16]:
df.describe(include='object')

Unnamed: 0,first_name,last_name,eye_color
count,5,5,4
unique,5,5,4
top,Harold,Miller,green
freq,1,1,1


## Exploring Column Value counts

Note: the function display is used to print the data in a clean fashion

In [17]:
for column_name in df:
    print(column_name)
    display(df[column_name].value_counts(dropna=False).sort_index(ascending=False))
    print()

first_name


Scott     1
Robert    1
Peter     1
Mike      1
Harold    1
Name: first_name, dtype: int64


last_name


Milner      1
Miller      1
Jacobson    1
Cooze       1
Ali         1
Name: last_name, dtype: int64


unique_id


95    1
79    1
66    1
46    1
30    1
Name: unique_id, dtype: int64


age


48    1
45    1
44    1
39    1
33    1
Name: age, dtype: int64


eye_color


hazel    1
green    1
brown    1
blue     1
NaN      1
Name: eye_color, dtype: int64


weight


166.308877    1
158.979554    1
147.616301    1
135.216292    1
117.544303    1
Name: weight, dtype: int64


favorite_number


81    1
61    1
41    1
21    1
1     1
Name: favorite_number, dtype: int64




## Showing Unique Values

In [18]:
for column_name in df:
    print("Column name:", column_name )
    print('Number of unique values:', df[column_name].nunique())
    print(df[column_name].unique())
    print()

Column name: first_name
Number of unique values: 5
['Mike' 'Robert' 'Peter' 'Scott' 'Harold']

Column name: last_name
Number of unique values: 5
['Miller' 'Jacobson' 'Ali' 'Milner' 'Cooze']

Column name: unique_id
Number of unique values: 5
[79 46 30 95 66]

Column name: age
Number of unique values: 5
[45 44 39 48 33]

Column name: eye_color
Number of unique values: 4
['green' 'brown' 'blue' 'hazel' nan]

Column name: weight
Number of unique values: 5
[158.97955412 166.3088775  135.21629182 117.54430327 147.61630054]

Column name: favorite_number
Number of unique values: 5
[ 1 21 41 61 81]



## Checking how many null values each column has
Note: Operations on a data frame typically result in a 1 dimension lower output   

In [19]:
df.isnull().sum()

first_name         0
last_name          0
unique_id          0
age                0
eye_color          1
weight             0
favorite_number    0
dtype: int64