#### Importing Data
- pd.read_csv(filename, sep=delimiter, usecols=load these cols, nrows=# of rows, skiprows=# of rows to skip, names=column names, dtype={'colname':dtype}, na_values={'colname':na_value}, error_bad_lines=False(skips corrupted lines of data), warn_bad_lines=True (provides message of line when skipped)) | From any flat file (CSV,TSV,etc)
- pd.read_table(filename) | From a delimited text file (like TSV)
- pd.read_excel(filename, usecols='A:G,AA,ZZ', sheet_name=see below, true_values=['value to set to True'], false_values=['value to set to False'], parse_dates['date column to be converted to datetime']) | From an Excel file
    - sheet_name: can use number of sheet or name of sheet. To load all sheets, will need to set sheet_name=None. If set to None, will load as dictionary objects. Will need to loop through the values in order to combine dataframes:
                   
                    # Create an empty data frame
                    all_responses = pd.DataFrame()

                    # Set up for loop to iterate through values in responses
                    for df in responses.values():
                      # Print the number of rows being added
                      print("Adding {} rows".format(df.shape[0]))
                      # Append df to all_responses, assign result
                      all_responses = all_responses.append(df)
  
- pd.read_sql(query(vary per file type), connection_engine) | Read from a SQL table/database
    * need to create engine using sqlalchemy before calling http://localhost:8888/notebooks/repos/Toolkit/SQL-noSQL%20Toolkit.ipynb#SQLalchemy-and-Object-Relation-Mappers-(ORM)
- pd.read_json(json_string, orient=('record, column, split') | Read from a JSON formatted string, URL or file.
- pd.read_html(url) | Parses an html URL, string or file and extracts tables to a list of dataframes
- pd.read_clipboard() | Takes the contents of your clipboard and passes it to read_table()
- pd.DataFrame(dict) | From a dict, keys for columns names, values for data as lists

##### loading nested JSON
- need to import: from pandas.io.json import json_normalize
- json_normalize(sep=delimiter,record_path=string path attribute, meta=list of other attributes, meta_prefix=separates metadata by prefix) | Returns a dictionary/list of dictionaries and returns a flattened dataframe

        # deep nested example using record path and meta arguments
        flat_cafes = json_normalize(data["businesses"],
                                    sep="_",
                                    record_path="categories",
                                    meta=['name', 
                                          'alias',  
                                          'rating',
                                          ['coordinates', 'latitude'], 
                                          ['coordinates', 'longitude']],
                                    meta_prefix='biz_')

#### Exporting Data
- df.to_csv(filename) | Write to a CSV file
- df.to_excel(filename) | Write to an Excel file
- df.to_sql(table_name, connection_object) | Write to a SQL table
- df.to_json(filename) | Write to a file in JSON format

#### Create Test Objects
Useful for testing code segements

- pd.DataFrame(np.random.rand(20,5)) | 5 columns and 20 rows of random floats
- pd.Series(my_list) | Create a series from an iterable my_list
- df.index = pd.date_range('1900/1/30', periods=df.shape[0]) | Add a date index

#### Viewing/Inspecting Data
- df.head(n) | First n rows of the DataFrame
- df.tail(n) | Last n rows of the DataFrame
- df.shape | Number of rows and columns
- df.info() | Index, Datatype and Memory information
- df.describe() | Summary statistics for numerical columns
- s.value_counts(dropna=False) | View unique values and counts
- df.apply(function) | calls a function to the entire dataframe
    - you can call the axis (0=columns, 1=rows)
    - you can call a result_type ('expand' = unwraps list, 'broadcast' = applies function to all columns)
    - can add args= for arguments

#### Selection
- df[col] | Returns column with label col as Series
- df[[col1, col2]] | Returns columns as a new DataFrame
- s.iloc[0] | Selection by position
- s.loc['index_one'] | Selection by index
- df.iloc[0,:] | First row
- df.iloc[0,0] | First element of 
- df.iat([0],[0]) | row & column
- df.at([0], ['Country']) | row & column
- df.ix[2] | Select single row in subset of rows
- df.ix[:,'Capital'] | Select a single column of subset of columns
- df.ix[1,'Capital'] | Select rows and columns
- s[~(s > 1)] | Series s where value is not >1
- s[(s < -1) | (s > 2)] | s where value is <-1 or >2
- df[df['Population']>1200000000] | Use filter to adjust DataFrame
- s['a'] = 6 | set index a of Series s to 6

#### Data Cleaning
- df.columns = ['a','b','c'] | Rename columns
- pd.isnull() | Checks for null Values, Returns Boolean Arrray
- pd.notnull() | Opposite of pd.isnull()
- df.dropna() | Drop all rows that contain null values
- df.dropna(axis=1) | Drop all columns that contain null values
- df.dropna(axis=1,thresh=n) | Drop all rows have have less than n non null values
- df.fillna(x) | Replace all null values with x
- s.fillna(s.mean()) | Replace all null values with the mean (mean can be replaced with almost any function from the statistics module)
- s.astype(float) | Convert the datatype of the series to float
- s.replace(1,'one') | Replace all values equal to 1 with 'one'
- s.replace([1,3],['one','three']) | Replace all 1 with 'one' and 3 with 'three'
- df.rename(columns=lambda x: x + 1) | Mass renaming of columns
- df.rename(columns={'old_name': 'new_ name'}) | Selective renaming
- df.set_index('column_one') | Change the index
- df.rename(index=lambda x: x + 1) | Mass renaming of index
- df.drop(['a', 'c']) | Drop value from row (axis=0)
- df.drop('Col_name', axis=1) | Drop values from columns (axis=1)

#### Filter, Sort, and Groupby
- df[df[col] > 0.5] | Rows where the column col is greater than 0.5
- df[(df[col] > 0.5) & (df[col] < 0.7)] | Rows where 0.7 > col > 0.5
- df.sort_values(col1) | Sort values by col1 in ascending order
- df.sort_values(col2,ascending=False) | Sort values by col2 in descending order
- df.sort_values([col1,col2],ascending=[True,False]) | Sort values by col1 in ascending order then col2 in descending order
- df.sort_index | Sort by labels along an axis
- df.groupby(col) | Returns a groupby object for values from one column
- df.groupby([col1,col2]) | Returns groupby object for values from multiple columns
- df.groupby(col1)[col2] | Returns the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics module)
- df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean) | Create a pivot table that groups by col1 and calculates the mean of col2 and col3
- df.groupby(col1).agg(np.mean) | Find the average across all columns for every unique col1 group
- df.apply(function, axis=) | Apply the function across the entire DataFrame
    - axis must be specified 0=column, 1=row)
    - can be used with lambda function
- df.applymap() | Apply function element-wise
- df.rank() | assign ranks to entries

#### Join/Combine
- df1.append(df2) | Add the rows in df1 to the end of df2 (columns should be identical)
- pd.concat([df1, df2],axis=1) | Add the columns in df1 to the end of df2 (rows should be identical)
- df1.join(df2,on=col1,how='inner') | SQL-style join the columns in df1 with the columns on df2 where the rows forcol have identical values. 'how' can be one of 'left', 'right', 'outer', 'inner'
- df1.merge(df2, on=column, left_on=df1 column,right_on=df2 column) | merges dataframes on a specified matching key column (must be same datatype, only merges matching data in both dfs)

#### Statistics
These can all be applied to a series as well.

- df.describe() | Summary statistics for numerical columns
    - to describe non-numeric columns add argument exclude= 'number'
- df.mean() | Returns the mean of all columns
- df.corr() | Returns the correlation between columns in a DataFrame
- df.count() | Returns the number of non-null values in each DataFrame column
- df.max() | Returns the highest value in each column
- df.min() | Returns the lowest value in each column
- df.median() | Returns the median of each column
- df.std() | Returns the standard deviation of each column

#### Datetime

- pd.to_datetime - converts a date to a datetime object
- dt.tz_localize('America/New_York', ambiguous='NaT') - ambiguous argument replaces ambiguous times with NaT (not a time)
- dt.tz_convert('Europe/London') - converts to stated timezone
- dt.weekday_name - lists the day of the week for each datetime

#### Iterating

- iterrows() - creates an indexed list of each row observation (like iloc, but creates index) and stores as index and Series
- itertuples() - like itterrows, but strores data as a special tuple, that when calling the named value can call index and all columns by . method. ex. tuple.Index, tuple.Col1, tuple.Coln

##### including Numpy to iterate
- pandas is built on Numpy, so DataFrames can essentially use broadcasting methods to perform functions
- df['column'].values will return a Numpy array of that column's values
