# Some Python Basics

## Variables

Variables in Python are dynamically typed, meaning that the type is inferred from assignment, which is often referred to as duck typing (if it acts like a duck and looks like a duck, its a duck).  


In [10]:
var1 = 5
var2 = 100
var3 = True
var4 = None
var5 = "I can be anything"
print(var1, var2, var3, var4, var5)

5 100 True None I can be anything


To get more inforation on a variables type, you can use either the type() function.

In [11]:
var6 = 100
var7 = "darkness"
print(type(var6), type(var7))

<class 'int'> <class 'str'>


We can also get input from users to fill our variables.

In [12]:
var8 = input("whats your name? ")
print("Hello, " + var8)

whats your name?  Chelsea


Hello, Chelsea


## Booleans

Booleans can be handy when working with dataframes, as we will see later.  You can also add them and they are interpreted as False = 0 and True = 1

In [13]:
var9 = True
var10 = False
var11 = var9 + var10
print(var9, var10, var11)

True False 1


## Strings
Strings in Python are created with ' or " and are immutable, if changes need to be made to a string a new one is returned.  The default encoding for strings in Python is Unicode UTF-8, this means that they are automatically compatible with different languages.  Python strings work similar to STL strings since they are classes with support functions built in, however in Python the amount of functionality is much larger.

In [14]:
var12 = 'Hello ' + "world, " + "Python"
var13 = var12.lower()

print(var13)

hello world, python


## Lists
In c++ choosing which container to use is actually very important (list, queue, stack, vector, array?), in Python this choice is simplified into a single container that has the functionality of everything.  To create a list use the square brackets [].  Notice that the types don't have to match, we don't care about variable types.

In [15]:
list1 = [1, 2, 3, 4, 5.0, 6.0, True, False]

list1.append(123)

print(list1)

[1, 2, 3, 4, 5.0, 6.0, True, False, 123]


# Pandas
In the last lesson, we got to see Pandas in action by using it to make some visualizations of the Titanic data.  Let's take some time to explore some of the cool features of Python and Pandas.

## The History of Pandas

Origins:

* 2008: The Pandas project was started by Wes McKinney when he was working at AQR Capital Management. The main motivation was to have a flexible tool to perform quantitative analysis on financial data. The name "pandas" is derived from the term "panel data," a common term for data that involves observations over time.

Early Development:

* 2009: Wes McKinney released the first public version of pandas. The initial versions laid the foundation with data structures like Series and DataFrame, which have since become staples for data manipulation in Python.

Increasing Adoption:

* 2010s: As data science and Python grew in popularity during the 2010s, so did pandas. It quickly became one of the cornerstones of the scientific stack in Python alongside libraries like NumPy, SciPy, and Matplotlib.
The library received significant contributions from many developers worldwide, enhancing its capabilities and making it more robust.

Books and Documentation:

* 2012: Wes McKinney published "Python for Data Analysis," which prominently features pandas and its application in data analysis. This book played a crucial role in introducing many individuals to pandas and data analysis in Python.


Pandas is often seen as a gateway to data science in Python. Its simple yet powerful interface makes it a favorite for beginners and professionals alike.
With the rise of big data tools like Apache Spark, Dask, and Vaex, pandas also integrates with these tools, allowing users to scale their analyses when necessary.

## DataFrames and Series

The DataFrame is the primary structure we will be using for this class.  It is an associative, two dimensional data structure. Imagine a spreadsheet page,  SQL table, or flat file.  The series object is a one dimensional data structure that represents a single column of data.

We can manually create a DataFrame from dictionaries, lists, series, and much else.  We can also add new features to a DataFrame, or even combine multiple DataFrames.  If our data is provided to us we can read or write to a variety of different formats: CSV, Excel, SQL, JSON, URL, clipboard, etc.

A series object can be thought of as single column of a DataFrame.

## Common useful Pandas methods

### DataFrame Creation and Input/Output
- `pd.DataFrame()`: Create a DataFrame.
- `pd.read_csv()`: Read a CSV file into a DataFrame.
- `pd.read_excel()`: Read an Excel file into a DataFrame.
- `df.to_csv()`: Write a DataFrame to a CSV file.
- `df.to_excel()`: Write a DataFrame to an Excel file.

### Viewing and Inspecting Data
- `df.head()`: View the first few rows of the DataFrame.
- `df.tail()`: View the last few rows of the DataFrame.
- `df.info()`: Get a concise summary of the DataFrame.
- `df.describe()`: Generate descriptive statistics.
- `df.shape`: Get the dimensions of the DataFrame.
- `df.columns`: Get the column labels.
- `df.index`: Get the row labels.

### Selection and Filtering
- `df.loc[]`: Access a group of rows and columns by labels.
- `df.iloc[]`: Access a group of rows and columns by integer position.
- `df[df['column'] > value]`: Filter rows based on column values.
- `df.query()`: Query the DataFrame with a boolean expression.

### Grouping and Aggregation
- `df.groupby()`: Group data by one or more columns.
- `df.agg()`: Aggregate using one or more operations over the specified axis.
- `df.size()`: Get the size of the DataFrame.
- `df.sum()`: Compute the sum of values.
- `df.mean()`: Compute the mean of values.
- `df.median()`: Compute the median of values.
- `df.min()`: Compute the minimum of values.
- `df.max()`: Compute the maximum of values.
- `df.count()`: Count the number of non-NA/null observations.

### Data Cleaning and Preparation
- `df.drop()`: Drop specified labels from rows or columns.
- `df.dropna()`: Remove missing values.
- `df.fillna()`: Fill missing values.
- `df.replace()`: Replace values.
- `df.rename()`: Rename labels.
- `df.astype()`: Cast a pandas object to a specified dtype.
- `df.sort_values()`: Sort by the values along either axis.
- `df.sort_index()`: Sort by the index.
- `df.set_index()`: Set the DataFrame index using existing columns.
- `df.reset_index()`: Reset the index, or a level of it.

### Merging and Joining
- `pd.merge()`: Merge DataFrame objects by performing a database-style join.
- `df.join()`: Join columns with other DataFrame.
- `pd.concat()`: Concatenate pandas objects along a particular axis.

### Date and Time
- `pd.to_datetime()`: Convert argument to datetime.
- `df['column'].dt`: Accessor object for datetime-like properties.

### String Methods
- `df['column'].str`: Accessor object for string methods.
- `df['column'].str.contains()`: Test if pattern or regex is contained within a string of a Series or Index.
- `df['column'].str.replace()`: Replace occurrences of pattern/regex/string with some other string.

### Statistical Functions
- `df.corr()`: Compute pairwise correlation of columns.
- `df.cov()`: Compute pairwise covariance of columns.
- `df.var()`: Compute variance of columns.
- `df.std()`: Compute standard deviation of columns.
- `df.mad()`: Compute mean absolute deviation of columns.
- `df.kurt()`: Compute kurtosis of columns.
- `df.skew()`: Compute skewness of columns.

### Visualization
- `df.plot()`: Make plots of DataFrame using matplotlib.

### Miscellaneous
- `df.pivot()`: Produce pivot table based on 3 columns of this DataFrame.
- `df.pivot_table()`: Create a spreadsheet-style pivot table as a DataFrame.
- `df.apply()`: Apply a function along an axis of the DataFrame.
- `df.applymap()`: Apply a function to a DataFrame elementwise.


## Data wrangling

Let's explore the Instacart data to understand more about purchasing habits using Pandas methods.

### Frequency Tables

What are the top 10 most commonly ordered products

In [16]:
import pandas as pd

df = pd.read_csv('assets/instacart_sample.csv')

df.groupby('product_name').size().sort_values(ascending=False).head(10)

product_name
Banana                    465
Bag of Organic Bananas    411
Organic Strawberries      291
Organic Baby Spinach      244
Organic Hass Avocado      222
Large Lemon               187
Organic Avocado           179
Strawberries              152
Organic Whole Milk        145
Limes                     141
dtype: int64

### Sorting

Can we sort the dataset by the name of the product in alphabetical order?

In [17]:

sorted_df = df.sort_values(by='product_name')
sorted_df.head(10)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_name,aisle_id,department_id
26986,2655954,25773,1,0,5996,prior,17,2,19,1.0,#2 Coffee Filters,26,7
21744,672887,3756,3,1,146132,prior,3,0,14,30.0,0% Fat Blueberry Greek Yogurt,120,16
5343,978516,49517,6,1,113937,prior,24,5,20,5.0,0% Fat Free Organic Milk,84,16
3286,1954933,49517,6,1,113937,prior,41,5,19,5.0,0% Fat Free Organic Milk,84,16
17306,2958521,49517,1,0,136800,train,4,4,7,9.0,0% Fat Free Organic Milk,84,16
19338,2843518,49517,3,1,33524,prior,51,2,8,4.0,0% Fat Free Organic Milk,84,16
422,3267808,22022,11,1,44240,prior,4,6,0,0.0,0% Fat Greek Yogurt Black Cherry on the Bottom,120,16
18124,1589871,37508,10,0,106438,prior,2,2,10,30.0,0% Fat Organic Greek Vanilla Yogurt,120,16
24153,1563269,37508,2,1,36287,prior,27,1,12,13.0,0% Fat Organic Greek Vanilla Yogurt,120,16
20129,340759,38928,4,1,41852,prior,5,4,19,24.0,0% Greek Strained Yogurt,120,16


### Groupby

For each user, what is the order size for each user?

In [22]:
order_size = df.groupby(['user_id', 'order_id'])['add_to_cart_order'].max()
mean_order = order_size.groupby('user_id').mean()
mean_order

user_id
10         5.0
14         8.0
17         1.0
19        20.0
27         5.0
          ... 
206172     1.0
206174    17.0
206178     9.0
206180     9.0
206184     6.0
Name: add_to_cart_order, Length: 28171, dtype: float64

### Subsets


Create a subset of products that contain banans (product_name = Banana, Bunch of Bananas, banana flavor, etc.)

In [23]:
bananas = df[df['product_name'].str.contains('banana', case=False)]

bananas.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_name,aisle_id,department_id
26,1959304,24852,2,1,62292,prior,8,2,18,28.0,Banana,24,4
35,2247363,24852,1,1,188096,prior,16,3,9,6.0,Banana,24,4
71,2637446,24852,4,1,48409,prior,18,1,8,15.0,Banana,24,4
77,3218842,24852,3,1,195231,prior,14,2,16,27.0,Banana,24,4
112,1955087,24852,2,1,138539,prior,5,6,8,8.0,Banana,24,4


### Feature engineering

What is the most commonly ordered product for each time of day where time of day is categorized as early morning (4 - noon), afternoon (noon - 8pm), and night (8am to 4am)?

In [28]:

# Create a new column for time of day using .loc
df['time_of_day'] = 'Night'  # Default category
    
# Use .loc to categorize times of day
df.loc[(df['order_hour_of_day'] >= 4) & (df['order_hour_of_day'] < 12), 'time_of_day'] = 'Early Morning'
df.loc[(df['order_hour_of_day'] >= 12) & (df['order_hour_of_day'] < 20), 'time_of_day'] = 'Afternoon'
    
    
# Group by order_id and day of week to get order sizes
order_sizes = df.groupby(['order_id', 'time_of_day'])['add_to_cart_order'].max()

# Calculate average order size by time of day
avg_order_sizes = order_sizes.groupby('time_of_day').mean()
avg_order_sizes
    
# Sort the days in the typical calendar order
#sorted_days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
#avg_order_sizes = avg_order_sizes.reindex(sorted_days)

time_of_day
Afternoon        8.213755
Early Morning    8.527152
Night            8.813728
Name: add_to_cart_order, dtype: float64