# Why python for data analysis, machine learning?
There are lots of reasons that we want to use python for doing data science. It is certainly one of the younger programming languages used in the data science ecosystem (compared to say R and SAS) but it is used just as frequently for analysis as SAS and R. Having a good foundation in python and R, (and SAS or SPSS) should be a *must* for **every data scientist** and machine learning enthusiast. 

In this course, python allows for an open source method of performing machine learning that runs from just about any machine. So let's start with looking at Numpy and Pandas pachages for analyzing data. 

With that in mind, let's go over the following:
- Numpy matrices
- Simple operations on arrays and matrices
- Indexing with numpy
- Pandas for tabular data
- Representing categorical data (discussion point)

In [None]:
import sys
import numpy as np

print(sys.version)
print(np.__version__)

In [None]:
x = np.random.rand(5,3)
x

In [None]:
x.shape

In [None]:
x.dtype

In [None]:
# will this work?
# y = np.random.rand(3,4)
# z = x*y
# z

In [None]:
# we can designate what matrix multiplication is directly using objects
z = np.dot(x,y)
z

In [None]:
# or we can use the overloaded matrix multiplication operator
z = x @ y
z

# Indexing

In [None]:
x1 = np.array([[1,2,3],
               [4,5,6],
               [7,8,9]])
x1

In [None]:
for row in range(x1.shape[0]):
    print(x1[row,1])

In [None]:
print(x1[:,1])
print(x1[:,1]>3)
# slicing
print(x1[ x1[:,1]>3 ])

In [None]:
x2 = np.array(range(10))
print(x2)
x2.shape

In [None]:
idx = x2>5
print(idx)
print(x2[idx])

In [None]:
x2[x2>5] # rows of x2 where x2 is greater than 5

# Named columns
So what if we have a matrix of data where each row is some observation of features and the feature values are represented in each column?

In [None]:
col_names = ['temperature','time','day']
data = np.array([[64,2100,1],
                 [50,2200,4],
                 [48,2300,3],
                 [34,0,   2],
                 [30,100, 5]])
data

In [None]:
data2 = data[data[:,1]>1500]
data2

In [None]:
# pandas to the rescue
import pandas as pd

df = pd.DataFrame(data,columns=col_names)
df

In [None]:
# can always access the backend numpy with .values
print(type(df.to_numpy()))
df.to_numpy()

In [None]:
df[df.time>1500]

In [None]:
# lets get a description of the data
df.info()

In [None]:
df.day[df.day==1] = 'Mon'
df

In [None]:
# there is almost always a more efficient built in pandas function
df.day.replace(to_replace=range(7),
               value=['Su','Mon','Tues','Wed','Th','Fri','Sat'],
               inplace=True)
df

In [None]:
# notice how the type of the column has changed to an object "categorical"
df.info()

In [None]:
# one hot encoding example
pd.get_dummies(df.day)

# Some Pandas Syntax

In [None]:
# slicing into a pandas dataframe
print(df.day)
print(df['day'])
df[['day','temperature']]

In [None]:
print(df.day[2])
print(df.day[2:])

In [None]:
# index location
df.iloc[3:]

In [None]:
df.iloc[3:][['day','temperature']]

In [None]:
df[['day','temperature']].iloc[3:]

In [None]:
df.mean()

In [None]:
df.std()

In [None]:
df.mean()/df.std()

In [None]:
df.time.unique()

# Pandas Block Manager
Let's take a look at some important points from the following post:
 - https://uwekorn.com/2020/05/24/the-one-pandas-internal.html

This is the pandas BlockManager, which tries to group internal structures together to make things fast:
<img src="https://uwekorn.com/images/pd-df-perception.002.png" width=200 height=200 />

In [None]:
df

In [None]:
print(df._data.nblocks)
df._data

## Advantages and disadvantages:
This can speed up operations because it inhenertly can apply operations along columns in a single pass over the data (like sums, etc.) and therefore is using c++ for much of the heavy lifting.

But, **it might be bad** when you are adding columns to the data because it can trigger consolidation of columns, which means copying over data in numpy to creata new matrix. The slow down also doesn't show up until a needed column is accessed (lazy data copying). Let's do an example from:  https://uwekorn.com/2020/05/24/the-one-pandas-internal.html

**Block consolidation is triggered after 100 blocks of data are reached.**

In [None]:
# we will start with a 2 column dataframe
# one column is an int and the other a float
# becasue there are two datatypes this has two blocks
df_example = pd.DataFrame({
    'int64': np.arange(1024 * 1024, dtype=np.int64),
    'float64': np.arange(1024 * 1024, dtype=np.float64),
})
df_example

In [None]:
%%time 

# but now lets start to add columns one by one
# to be fast, pandas adds each as a new block 
# so we will have 99 blocks (2+97 new ones)
for i in range(97):
    df_example[f'new_{i}'] = df_example['int64'].to_numpy()
    
print(df_example._data.nblocks)
df_example

In [None]:
%time df_example['dummy_name3'] = df_example['int64'].values # copy over some new columns
print('Number of blocks in data:',df_example._data.nblocks)

%time df_example['dummy_name4'] = df_example['int64'].values # copy over some new columns
print('Number of blocks in data:',df_example._data.nblocks)


In [None]:
df_example.info()