## In-Class Notebook 

### Numpy Review

Data scientists primarily deal with structured numeric data. While tuples, lists and dictionaries are useful for general programming, *vectors* and *arrays* are more useful for mathematical calculations.

[NumPy](https://docs.scipy.org/doc/numpy-1.13.0/index.html) is an *extension module* to the Python language that provides vectors and arrays. NumPy has been imported with alias `np` in the cell below. We will now go through some basic numpy operations. 


In [2]:
import numpy as np


#### 1. Checking the numpy version you have intalled in your system

In [3]:
# Type out python command for displaying version
np.__version__

'1.16.5'

#### 2. Create a 10x10 matrix, in which the elements on the borders will be equal to 1, and inside 0

In [None]:
# Create your 10X10 matrix and store it in x (HINT: Make it all ones or zeros to begin with)
x = 0

# Slice indexing to modify the matrix

print(x)

#### 3. Compute the multiplication of two given matrixes

In [None]:

p = [[1, 0], [0, 1]]
q = [[1, 2], [3, 4]]
print("original matrix:")
print(p)
print(q)

# Enter the solution to do matrix multiplication

# Return shape of resultant matrix

print("Result of the said matrix multiplication:")

# print result


#### 4.Generate random numbers from a normal distribution with mean 2 and standard deviation 1

In [None]:
mu = 2
s_dev = 1.5

# Enter numpy command to generate 

# Check shape of resulting variable 


#### 5. Shuffling Arrays : Shuffle numbers between 0 and 10

In [None]:

x = np.arange(10) # Creatinga vector of 10 elements

# Randomly shuffle the elements of the vector

# Print the result in a nice format


## Pandas Dataframe Review

Pandas is part of an ecosystem of Python software used for statistical analysis.

Pandas extends Python with two datatypes used in statistical analysis: the Series and the DataFrame.

The name "Pandas" is derived from "Panel Data", a particular way of representing data represented in Pandas by the DataFrame.

As with NumPy, we need to import Pandas. We'll see almost all of our notebooks starting with:


In [None]:
import pandas as pd

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

There are a number of ways you can construct a DataFrame. One of the most common is to use a python Dictionary to label the different columns.

In [None]:
ages = np.array([20, 39, 45, 18, 56, 90])
salary = np.array([10000, 40000, 50000, 8000, 55000, 5000])

In [None]:

# Using a python dictionary, create pandas dataframe

df

You can also specify the index column (_e.g._ it could be the names of the people represented in the age/salary data)

In [None]:
# Name your specific rows with an index columns 

df2

#### 1. Accessing Rows of a Data Frame

You can index a dataframe by row index to extract a set of rows. For integral indices, the range is specified as **from:to** to include entries from **from** to strictly less than **to**.

In [None]:
# Select the sub-dataframe with rows 1 and 2

You can do similar slices for named rows, but for inexplicible reasons, the range now includes all of the specified rows (i.e. it doesn't end before the last index).

In [None]:
# Select rows corresponding to the indices you used above to see the difference

We can also access a single row of the data frame using index operations based on the location of the data.

In [None]:
# Use the `loc` operator to select a particular row index

#### 2. Accessing Elements and Columns of a DataFrame

You can refer to each column using the name of the column. 

In [None]:
# Select age column

In [None]:
# You can also use this form..


In [None]:
Once you've selected a column, you can access elements using the index for that specific row.

In [None]:
# Select the age (or any other attribute) for a particular row

In [None]:
# Alternate way using the `loc` operator

#### 3. Adding new columns

In [None]:
We can add new columns to the data frame simply by assigning to them.

In [None]:
# Create a new column where each element is the product of the corresponding age and salary

## Exploratory Data Analysis with a data set

Go to https://archive.ics.uci.edu/ml/datasets/Breast+Cancer and download the file. You can change the data file format to csv.

In [None]:
# read the csv file to a data frame (there is literally a function called read_csv)


In [None]:
df.head() # Shows the top few rows of the dataframe that you just imported 

In [None]:
df.info()#check the data type- it's not always desired form, you can change

### Data Cleaning Ideas
1. Change column names 
2. Find Null values and clean them
3. Check the data types
4. Convert ordinal category strings to number
5. Convert non-ordinal category strings to **dummified** array

In [None]:
# Let us check what columns there are

In [None]:
# create a dictionary and use it to rename the columns
columns = ['class','age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat']

# Create a list of out the dataframe column names 

# Pair it with elements of the `columns` list using the zip() function

# Create a dictionary out of the 

print(dd)

In [None]:
df.rename(columns=dd, inplace=True)

In [None]:
df.head()

In [None]:
# we want to change "class" to be 0 or 1
# you can change values using .apply() : 0- no-recurrence-events , 1- recurrence events; check out the lambda function to acheive this with minimum code

#make sure that there is no other values beofre you use if-else. we know it from the unique values we've inspected above

In [None]:
df['class'].unique()

In [None]:
df.head()

Age ranges are inconvenient to handle and it would be more convenient if this were a numeric column. Let us make each element the average of the range to achieve this.

In [None]:
# List with average of each age range
ageval=[24,35,45,55,65,75]

In [None]:
# Similar to what we did with the column names, let us make a dictionary to link the age range to its corresponding
# age average



In [None]:
#view the dictionary that you create

In [None]:
#replace the age column `df['age']` with its numeric counter parts using apply and lambda

In [None]:
df.head()

In [None]:
# Use get_dummies on the menopause column since those elements don't make sense
# get_dummies creates a new dataframe with columns corresponding to discrete row elements ; the elements of the new
# columns will be binary

# concatenate newly created df to your existing df

In [None]:
# Here are some tricks to extract the average of the tumor size range. 

tumors=sorted(list(df['tumor-size'].unique())) #we'll create ordinal variable (average of the range as representitive of the category, but again, you can assign what makes sense for you)
tsize=[(int(x[0])+int(x[1]))/2 for x in [x.split('-') for x in tumors]] #just some tricks to extract numbers from the string

In [None]:
# Create a dictionary once again for tumorsize to convert to numeric

In [None]:
# replace the tumor-size column with numeric counterparts

### Visualizing the data : Using Matplotlib

One way to do EDA visually is to make some basic plots of the data to extract basic information from it. Some of these plots include histograms and correlation plots. 

In [None]:
import matplotlib.pyplot as plt

In [None]:
# plot histogram of ages

In [None]:
# Change the number of bins

In [None]:
# Plot the histogram of another variable of your choice

In [None]:
dfs = df1[['class','age','tumor-size','deg-malig','premeno']]

#after changing them to numbers, we can see correlation matrix
corr = dfs.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
# Or using plt
import matplotlib.pyplot as plt
f = plt.figure(figsize=(19, 15))
plt.matshow(dfs.corr(), fignum=f.number)
plt.xticks(range(dfs.shape[1]), dfs.columns, fontsize=14, rotation=45)
plt.yticks(range(dfs.shape[1]), dfs.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);