# Create a DataFrame I 
A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQL table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.

You can pass in a dictionary to pd.DataFrame(). Each key is a column name and each value is a list of column values. The columns must all be the same length or you will get an error. Here’s an example:

In [2]:
import pandas as pd

df1 = pd.DataFrame(
    {
        'name': ['John Smith', "Jane Doe", 'Joe Schmo'],
        'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway'],
        'age': [28, 24, 19]
    }
)

print(df1)

         name         address  age
0  John Smith    123 Main St.   28
1    Jane Doe  456 Maple Ave.   24
2   Joe Schmo    789 Broadway   19


# Create a DataFrame II
You can also add data using lists just having in mind to add a keyword argument called columns that represent the keys if we are trying to compare it with dictionaries. 

In [3]:
df2 = pd.DataFrame([
    ['John Smith', '123 Main St.', 34],
    ['Jane Doe', '456 Maple Ave.', 28],
    ['Joe Schmo', '789 Broadway', 51]
    ],
    columns=['name', 'address', 'age'])
print(df2)

         name         address  age
0  John Smith    123 Main St.   34
1    Jane Doe  456 Maple Ave.   28
2   Joe Schmo    789 Broadway   51


# Comma Separated Values (CSV)
CSV (comma separated values) is a text-only spreadsheet format. You can find CSVs in lots of places:

- Online datasets
- Export from Excel or Google Sheets
- Export from SQL

The first row of a CSV contains column headings. All subsequent rows contain values. Each column heading and each variable is separated by a comma:

```txt
column1,column2,column3
value1,value2,value3
```

# Loading and Saving CSVs
When we have data in a CSV, we can load it into a DataFrame in Pandas using read_csv()

In [None]:
pd.read_csv('file.csv')

When this method is called. The CSV file called file.csv is passed in as an argument. We can also save data to a CSV using to_csv()

In [None]:
df = pd.DataFrame()
df.to_csv('file.csv')

# Inspect a DataFrame 
When we load a new DataFrame from a CSV, we want to know what it looks like. If it's small we can do it by just printing it, but when it is very large, we use the method `.head()` in order to display the first 5 rows, or passing an argument to display n rows.

Another method is `.info()` which gives some statistics for each column. 

In [None]:
df = pd.read_csv('file.csv')

print(df.head()) # first 5 rows
print(df.head(10)) # first 10 rows
print(df.info()) # data types and non-null values

# Select Columns 
Now we know how to create and load data. Let's select parts of those datasets that are interesting or important to our analyses. If we have a specific colmn called customers, maybe we want to take the average or plot a histogram of the ages. In order to do either of these tasks, we need to select the column

1. Select the column as if you were selecting value from a dictionary using a key. `customer['age']`
2. If the name of a column follows all of the rules for a variable name then you can select it usign the followin notation: `df.MySecondColumn`

In [7]:
df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)

clinic_north = df['clinic_north']
print(type(clinic_north)) 
print(type(df)) 

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


# Selecting Multiple Columns 
When you have a larger DataFrame, you might want to select just a few columns. 
For instance, a last name and an email. To select two or more columns we would use `new_df = orders[['last_name', 'email']]`

In [None]:
# Following the example above 
clinic_north_south = df[['clinic_north', 'clinic_south']]

# Select Rows 
DataFrames are zero-indexed, meaning that we start with the 0th row and count up from there. 

In [8]:
# Selecting march from the data frame
march = df.iloc[2]
print(march)

month           March
clinic_east        81
clinic_north       96
clinic_south       65
clinic_west        96
Name: 2, dtype: object


# Selecting Multiple Rows 
When we cant to select more that one row we can use some methods:
1. dataframe.iloc[1:n] this would select all rows starting at the 1th row and up to but not including the n row (i.e. if n equals 9, it would be from 1 to 8)
2. dataframe.iloc[:4] this would select all rows up to but not including the 4th row. 
3. dataframe.iloc[-3:] this would select the rows starting at the 3rd to last row and up to and including the final row

In [9]:
april_may_june = df.iloc[3:]
print(april_may_june)

   month  clinic_east  clinic_north  clinic_south  clinic_west
3  April           80            80            54          180
4    May           51            54            54          154
5   June          112           109            79          129


# Select Rows with Logic I 
You can select a subset of DataFrame by using logical statements: 
`df[df.columnName == logic_expected]`. For example if we want to know which customers are under 25 years old, we might use `df[df < 25]`

In [10]:
january = df[df.month == 'January']
print(january)

     month  clinic_east  clinic_north  clinic_south  clinic_west
0  January          100           100            23          100


# Select Rows with Logic II 
We can also combine multiple logical statements, as long as each statement is in parentheses. This using & for conjuntion and | for disyunction. 

In [11]:
march_april = df[(df.month == 'March') | (df.month == 'April')]
print(march_april)

   month  clinic_east  clinic_north  clinic_south  clinic_west
2  March           81            96            65           96
3  April           80            80            54          180


# Select Rows with Logic III
Suppose we want to select the rows where some specific data is included in the dataframe. We can use .isin to check if the values are in the rows

In [12]:
january_february_march = df[df.month.isin(
  ['January', 'February', 'March']
)]

print(january_february_march)

      month  clinic_east  clinic_north  clinic_south  clinic_west
0   January          100           100            23          100
1  February           51            45           145           45
2     March           81            96            65           96


# Setting Indices 
When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. This is inelegant and makes it hard to use .iloc()

We can fix this using the method `.reset_index()`. 

In [13]:
df2 = df.loc[[1, 3, 5]]

print(df2)

df3 = df2.reset_index()

print(df3)

df2.reset_index(inplace = True, drop = True)

print(df2)

      month  clinic_east  clinic_north  clinic_south  clinic_west
1  February           51            45           145           45
3     April           80            80            54          180
5      June          112           109            79          129
   index     month  clinic_east  clinic_north  clinic_south  clinic_west
0      1  February           51            45           145           45
1      3     April           80            80            54          180
2      5      June          112           109            79          129
      month  clinic_east  clinic_north  clinic_south  clinic_west
0  February           51            45           145           45
1     April           80            80            54          180
2      June          112           109            79          129
