# In this lesson...

In this lesson, we'll go through the following:
1. [Importing Libraries](#import-libraries)
2. [Working With Mock Data](#mock-data)
2. [Understanding Mock Data](#understand-mock-data)


<hr>

<br id="import-libraries"> 

# <span style='color:blue;'>Import libraries</span>

In general, it's good practice to keep all of your library imports at the top of your notebook or program.

In [0]:
# Collection Libraries
import numpy as np
import pandas as pd

# Data Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Library Configurations: 
sns.set() # make seaborn override the styling of matplotlib graphs
pd.set_option('display.max_columns', None) # Make Pandas display all columns
pd.set_option('display.max_rows', None) # Make Pandas display all rows

<br id="mock-data">

<hr> 


# <span style='color:blue;'>Working Mith Mock Data </span>


## Helper functions To Create Mock Data

### Using Numpy's arange() Function To Generate Random Data:

**[numpy.arange()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.arange.html?highlight=arange#numpy.arange)** will return evenly spaced values within a given interval.

In [0]:
def generate_evenly_spaced_values_numpy_array(start, stop, step):
    """
        Info:
            This function return a numpy array of evenly spaced values within a given interval 
        Params:
            start (type: int)
            stop (type: int) 
            step (type: int)
        Output:
            numpy_array (type: numpy array)
    """
    numpy_array = np.arange(start, stop, step) 
    return numpy_array

### Using Numpy's randint() Function To Generate Random Data:
**[numpy.randint()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.randint.html)** will return random integers based on the “discrete uniform” distribution from low (inclusive) to high (exclusive).


In [0]:
def generate_random_array(min_num, max_num, number_of_samples):
    """
        Info:
            This function will return a numpy array of random samples, given a range

        Params:
            min_num, inclusive (type: int)
            max_num, exclusive (type: int) 
            number_of_samples (type: int)
        Output:
            numpy_array (type: numpy array)
    """
    rand_arr = np.random.randint(min_num, max_num, size=number_of_samples)
    return rand_arr

### Using Numpy's random.normal() Function To Generate Random Data:
[numpy.random.normal()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.normal.html) will return random samples from a normal (Gaussian) distribution.

In [0]:
def generate_random_normal_array(mean, std, number_of_samples):
    """
        Info:
            This function will return random samples from a normal (Gaussian) distribution.

        Params:
            mean, (“centre”) of the distribution. (type: float)
            std, Standard deviation (spread or “width”) of the distribution. (type: float)
            number_of_samples (type: int)
        Output:
            numpy_array (type: numpy array)
    """
    rand_normal_arr = np.random.normal(mean, std, number_of_samples)
    return rand_normal_arr

## Creating Mock Data Using Helper Functions

In [0]:
def get_sample_data():
    raw_data = {
    'first_name': ['Mike', 'Robert', 'Peter', 'Scott', 'Harold'], 
    'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
    'unique_id': generate_random_array(1, 100, 5),
    'age': generate_random_array(21, 70, 5),
    'eye_color': ['green', 'brown', 'blue', 'hazel', np.nan],
    'weight': generate_random_normal_array(150, 20, 5),
    'favorite_number': generate_evenly_spaced_values_numpy_array(1,100,20) 
}
    return pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'unique_id', 'age', 'eye_color', 'weight','favorite_number'])

## Helper Functions To Modify Mock Data

In [0]:
def rename_columns(dataframe, column_names):
    """
        Info:
            This function will rename all the columns of a pandas dataframe
        Params:
            dataframe (type: pandas dataframe)
            column_names: (type: List)
        Output:
            dataframe (type: pandas dataframe)
    """
    dataframe.columns = column_names
    return dataframe

In [0]:
def drop_record_by_value(dataframe, column_name, column_value):
    """
        Info:
            This function will drop records that contain a given value in a given column
        Params:
            dataframe: Target dataframe
            column_name: Column of focus
            column_value: Target value
        Output:
            dataframe (type: pandas dataframe)
    """
    new_df = dataframe[dataframe[column_name] != column_value]
    return new_df

## Renaming Columns Using Helper Function

In [0]:
# Get mock data
df = get_sample_data()
print("ORIGINAL DF:\n\n{}\n".format(df.head()))

# Set function params
new_column_names = ['first', 'last', 'id', 'age', 'eye_color', 'weight', 'fav_num']

# Create new df with renamed columns
new_df = rename_columns(df, new_column_names)
print("Updated DF, with new column names:\n\n{}\n".format(df.head()))

## Dropping A Row Based Off Of A Column Value Using Helper Function

In [0]:
# Get mock data
df = get_sample_data()
print("ORIGINAL DF:\n\n{}\n".format(df.head()))

# Set function params
column_name = 'eye_color'
column_value = 'hazel'

# Create new df with removed records
new_df = drop_record_by_value(df, column_name, column_value)

print("Updated DF, with dropped row based on specific column value:\n\n{}\n".format(new_df.head()))

<br id="understand-mock-data"> 

<hr>

# <span style='color:blue;'> Understanding The Mock Dataset<span>

## Visualizing the head and tail 

In [0]:
df.head(5)

In [0]:
df.tail(5)

## Checking the shape of the data frame

In [0]:
df.shape

## Checking the data types of the columns
 

In [0]:
df.dtypes

### More info

In [0]:
df.info()

## 5 summary statistic with numerical dtypes 

In [0]:
df.describe()

## 5 summary statistic with categorical dtypes 

In [0]:
df.describe(include='object')

## Exploring Column Value counts

Note: the function display is used to print the data in a clean fashion

In [0]:
for column_name in df:
    print(column_name)
    display(df[column_name].value_counts(dropna=False).sort_index(ascending=False))
    print()

## Showing Unique Values

In [0]:
for column_name in df:
    print("Column name:", column_name )
    print('Number of unique values:', df[column_name].nunique())
    print(df[column_name].unique())
    print()

## Checking how many null values each column has
Note: Operations on a data frame typically result in a 1 dimension lower output   

In [0]:
df.isnull().sum()