## A Dataframe

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.



### How to create a df from a dictionary


In [2]:
import pandas as pd

my_dict = {'Computer':1500,'Monitor':300,'Printer':150,'Desk':250}


df = pd.DataFrame(list(my_dict.items()),columns = ['Products','Prices'])

print (df)

   Products  Prices
0  Computer    1500
1   Monitor     300
2   Printer     150
3      Desk     250


### Create a subset of data

Sometimes you only need certain columns: here's how to create a subset of data 

    age_sex = titanic[["Age", "Sex"]]


In [3]:
comps = df.loc[df["Products"] == "Computer"]



## .isin() 

To select rows from multiple categories. 

This can get tedious when you want all states in one of three different regions, for example. Instead, use the .isin() method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

You can also use a pipe:

colors = ["brown", "black", "tan"]
condition = dogs["color"].isin(colors)
dogs[condition]

## Pick specific columns
Sometimes you want certain columns but not others.


In [5]:
subset= df[["Products","Prices"]]


# .head() 
returns the first few rows (the “head” of the DataFrame).


In [6]:
print (df.head())

   Products  Prices
0  Computer    1500
1   Monitor     300
2   Printer     150
3      Desk     250


# .info() 
shows information on each of the columns, such as the data type and number of missing values.


In [7]:
print (df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Products  4 non-null      object
 1   Prices    4 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 192.0+ bytes
None


# .shape 
returns the number of rows and columns of the DataFrame.


In [8]:
print (df.shape)

(4, 2)


# .describe() 
calculates a few summary statistics for each column.

In [9]:
print (df.describe())

            Prices
count     4.000000
mean    550.000000
std     636.396103
min     150.000000
25%     225.000000
50%     275.000000
75%     600.000000
max    1500.000000


## Parts of a DataFrame

To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:



### .values:
A two-dimensional NumPy array of values.


In [10]:
df.values

array([['Computer', 1500],
       ['Monitor', 300],
       ['Printer', 150],
       ['Desk', 250]], dtype=object)

### .columns: 

An index of columns: the column names.
You can usually think of indexes as a list of strings or numbers, though the pandas Index data type allows for more sophisticated options. (These will be covered later in the course.)

In [11]:
df.columns

Index(['Products', 'Prices'], dtype='object')

# Sorting rows

You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

Sort on …	Syntax
    one column	        df.sort_values("breed")
    multiple columns	df.sort_values(["breed", "weight_kg"])

By combining .sort_values() with .head(), you can answer questions in the form, "What are the top cases where…?".

## .agg()

The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. 

For example:

df['column'].agg(function)

Can also pass a list of functions into the list

# Cumulative statistics
Cumulative statistics can also be helpful in tracking summary statistics over time. 

You can calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values("date")

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()

# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()

# See the columns you calculated
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

# Dropping duplicates

Removing duplicates is an essential skill to get accurate counts because often, you don't want to count the same thing multiple times.

# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=["store","type"])
print(store_types.head())

# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset=["store","department"])
print(store_depts.head())

# Subset the rows where is_holiday is True and drop duplicate dates
holiday_dates = sales[sales["is_holiday"]==True].drop_duplicates("date")

# Print date col of holiday_dates
print(holiday_dates ["date"])


22,
27,
32,
37,
42,
47,
52,
57,
62,
67,
72,
77,
82,
87,
92,
97,


IndexError: list index out of range