# Pandas Data Structures
Pandas is a Python package that provides fast and flexible data structures that make data analysis easy and intuitive.

## 1. Importing Pandas and Reading Data Frames


Pandas allows the user to read in from many different data file types.

In [1]:
import pandas as pd

df = pd.read_csv('sample_data/california_housing_test.csv')
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.30,34.26,43.0,1510.0,310.0,809.0,277.0,3.5990,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0
...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23.0,1450.0,642.0,1258.0,607.0,1.1790,225000.0
2996,-118.14,34.06,27.0,5257.0,1082.0,3496.0,1036.0,3.3906,237200.0
2997,-119.70,36.30,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
2998,-117.12,34.10,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0


Pandas also allows the user to exclusively read in certain columns.

In [None]:
newdf = pd.read_csv('sample_data/california_housing_test.csv', usecols=['longitude', 'latitude', 'households'])
newdf

## 2. Viewing Data

Users can view their data in many different ways using pandas.

### .head() and .tail()

.head() will output the top (default is 5) rows of your data frame.

In [None]:
df.head() # top 5 rows by default

.tail() will output the bottom (deafault is 5) rows of your data frame.

In [None]:
df.tail() # bottom 5 rows by default

The user can pass in a number to specify how many rows they want to output.

In [None]:
df.head(3) # top 3 rows

In [None]:
df.tail(7) # bottom 7 rows

### .index and .columns

.index will output information about the indices such as the start index, the ending index (not inclusive), and the step between each index.

In [None]:
df.index

.columns will output the names of all the columns in your data frame.

In [None]:
df.columns

### .T

.T will output the transpose of your data frame.

In [None]:
df.T # the rows will swap with the columns

### .describe()

.describe() will output a quick analysis of your data frame.

In [2]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,-119.5892,35.63539,28.845333,2599.578667,529.950667,1402.798667,489.912,3.807272,205846.275
std,1.994936,2.12967,12.555396,2155.593332,415.654368,1030.543012,365.42271,1.854512,113119.68747
min,-124.18,32.56,1.0,6.0,2.0,5.0,2.0,0.4999,22500.0
25%,-121.81,33.93,18.0,1401.0,291.0,780.0,273.0,2.544,121200.0
50%,-118.485,34.27,29.0,2106.0,437.0,1155.0,409.5,3.48715,177650.0
75%,-118.02,37.69,37.0,3129.0,636.0,1742.75,597.25,4.656475,263975.0
max,-114.49,41.92,52.0,30450.0,5419.0,11935.0,4930.0,15.0001,500001.0


### Sorting Data by Axis and by Values

By using .sort_index() and providing the axis number, the axis provided will be sorted in the output.

In [None]:
df.sort_index(axis=0, ascending=False) # the index numbers will be sorted in descending order

In [None]:
df.sort_index(axis=1, ascending=True) # the column names will be sorted in alphabetical order

By using .sort_values() and providing the column name, the data frame outputed will be sorted based on the values of the column provided.

In [None]:
df.sort_values(by='total_rooms') # rows will be sorted based on the values in total_rooms

## 3. Selecting Data

Pandas also provides a variety of ways that the user can select specific data with. 

Users can pick out individual rows and columns or use : to specify a range of values. Things to note while using : is that the right value is not iclusive and using : by itself will give all values.

Some of these selections will output a series (a one-dimensional array) which can be outputed as a data frame instead by using double brackets [[]] instead of single brackets [].

### Bracket Notation

Users can select data using bracket notation:
df[column][row]

Providing one parameter will output the column that matches parameter unless given an index range in which case it will display data in the index range.

In [None]:
df['population'] # outputs the population column as a series

In [None]:
df[['population']] # outputs the population column as a data frame

In [None]:
df[5:10] # outputs rows 5-10 (not including 10)

In [None]:
df['population'][1] # outputs row 1 of the population column

In [None]:
df['population', 'households'][:4] # outputs rows 0-4 (not including 4) of the population column as a series

In [None]:
df[['population']][:4] # outputs rows 0-4 (not including 4) of the population column as a data frame

Users can select multiple columns by proving the list of column names in a set of brackets.

In [None]:
df[['population', 'households']][:4] # outputs rows 0-4 (not including 4) of the population and households columns

### Column Attribute

Data can also be selected using the column name 

---

followed by the row value(s) in brackets.

In [None]:
df.population[1] # outputs row 1 of the population column

In [None]:
df.population[:4] # outputs rows 0-4 (not including 4) of the population column as a series

### .at

.at is used for selecting a single data value at a given row and column.

In [None]:
df.at[1, 'population'] # outputs the value in row 1 of the population column

### .loc and .iloc

.loc and .iloc select data using a row, column order unlike the above methods which uses column, row. Providing only one of the parameters will output the rows specified by the parameter.

.loc selects data through labels that the user gives it.

In [None]:
df.loc[1] # outputs row 1 as a series

In [None]:
df.loc[[1]] # outputs row 1 as a data frame

In [None]:
df.loc[1, 'population'] # outputs row 1 of the population column

In [None]:
df.loc[:4, 'population'] # outputs the first 4 rows of the population column as a series

In [None]:
df.loc[:4, ['population']] # outputs the first 4 rows of the population column as a data frame

In [None]:
df.loc[:4, ['population', 'households']] # outputs the first 4 rows of the population and households columns

.iloc selects data using index positions rather than labels.

In [None]:
df.iloc[1] # outputs row 1 as a series

In [None]:
df.iloc[[1]] # outputs row 1 as a data frame

In [None]:
df.iloc[1,5] # outputs row 1 in column 5

In [None]:
df.iloc[:4, 5] # outputs the first 4 rows in column 5 as a series

In [None]:
df.iloc[:4, [5]] # outputs the first 4 rows in column 5 as a data frame

In [None]:
df.iloc[:4, :6] # outputs the first 4 rows in the first 6 columns

### Data Filters

Users can filter data using boolean conditions.

In [None]:
df[df.population < 30] # outputs the rows where population is less than 30

In [None]:
df[df['population'] == 27] # outputs the rows where population equals 27

In [3]:
df[df['population'] > 8000] # outputs the rows where population is greater than 8000

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
33,-118.08,34.55,5.0,16181.0,2971.0,8152.0,2651.0,4.5237,141800.0
321,-121.73,37.68,17.0,20354.0,3493.0,8768.0,3293.0,5.4496,238900.0
978,-121.53,38.48,5.0,27870.0,5027.0,11935.0,4855.0,4.8811,212200.0
1146,-117.27,33.15,4.0,23915.0,4135.0,10877.0,3958.0,4.6357,244900.0
1597,-117.12,33.49,4.0,21988.0,4055.0,8824.0,3252.0,3.9963,191100.0
2186,-116.14,34.45,12.0,8796.0,1721.0,11139.0,1680.0,2.2612,137500.0
2429,-117.2,33.58,2.0,30450.0,5033.0,9419.0,3197.0,4.5936,174300.0


Users can also use .isin to filter for specific values.

In [None]:
df[df.housing_median_age.isin([27.0, 43.0])] # outputs the rows where housing median age is 27 or 43

### Random Samples

Pandas also provides a way for the user to generate random samples from their data frame by using .sample().

In [None]:
df.sample() # outputs a single random row of data

In [None]:
df.sample(n=5) # outputs 5 random rows of data

In [None]:

df.sample(frac=0.1) # outputs a random 10% of the rows of data

## More Resources



*   Official Documentation: https://pandas.pydata.org/docs/index.html
*   Cheat Sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
*   Basic Tutorial: https://www.learnpython.org/en/Pandas_Basics

