# Pandas Review

## Import a .csv file

In [None]:
import pandas as pd
publications_df = pd.read_csv('publications.csv', delimiter=',')

publications_df

## View data
- .head()
- .sample()
- .info()
- .shape()
- .columns
- .describe(include="all")
- NEW!! .dtypes

In [None]:
publications_df

## Working with columns
### Select column
use one bracket for single column\
`name_df['column name']`\
use two brackets for multiple columns\
`name_df[['column name A', 'columns name B']]`


In [None]:
publications_df['journal_title']

In [None]:
publications_df[['journal_title','journal_abbr']]

### Rename columns
`df_name.rename(columns={'old name':'new name'})`

In [None]:
publications_df.rename(columns={'journal_title':'Journal_Title'})

### Drop columns
`df_name.drop(columns="column name to drop")`

To drop more than one column 

`df_name.drop(columns=["column1 to drop","column2 to drop"])`

[pandas.DataFrame.drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)

In [None]:
one_dropped_column=publications_df.drop(columns="pub_year")
one_dropped_column

In [None]:
two_dropped_columns=publications_df.drop(columns=['pub_year','authors'])
two_dropped_columns

### Add columns

df_name['new column name']=new_column_value 

In [None]:
publications_df['new_column']='new column value'

In [None]:
publications_df.columns

In [None]:
publications_df[['new_column']]

### Filter data

Start by writing the conditional statement first

`publications_df['JournalTitle'] == 'Nature communications'`

Then use the original select column syntax name_df['column name'] and substitute your conditional statement for 'column name'

`name_df[conditional statement]`

In [None]:
nature_communications_filter=publications_df['journal_title'] == 'Nature communications'

publications_df[nature_communications_filter]

### -----NEW!!-----
### Sort columns 
**Syntax:**\
`.sort_values(by='column name)` method. 

Pandas sorts in ascending order by default. To change the sort order add the parameter `ascending=False`.


In [None]:
publications_df.sort_values(by='pub_year', ascending=False)

### Addressing null values
Missing values display as `NaN` in pandas DataFrames.

In [None]:
publications_df.info()

#### .isna() | .notna()
Use the `.isna()` and `.notna()` methods to find null values and then filter these values from your DataFrame.

In [None]:
publications_df[publications_df['link_to_articles_citing_from_pubmed'].isna()]

**`NaN` values blow up pandas python scripts!** 

**Why?**
1. Math and `NaN` values cannot coexist. You must assign a zero, or remove the `NaN` value from your dataset.
2. Text methods and `NaN` values will not work.  Pandas considers `NaN` values to be floats, not strings.

#### .fillna()

The `.fillna()` method can replace a `NaN` with a value of your choosing.

In [None]:
publications_df['link_to_articles_citing_from_pubmed']=publications_df['link_to_articles_citing_from_pubmed'].fillna("No link available")


In [None]:
publications_df[publications_df['link_to_articles_citing_from_pubmed'].isna()]

### Duplicate Values

Use the `.duplicated()` method to find duplicate rows.

*Parameters*
- `subset=None` replace None with a column label or sequence of labels to tell Pandas what column to consider when identifying duplicates. If you do not identify a column, Pandas will evaluate all columns.
- `keep=False` shows all of the duplicated values in the dataset
- `keep='first'` shows the first duplicated value
- `keep='last'` shows the last duplicated value

In [None]:
publications_df.duplicated(keep=False)

In [None]:
publications_df.duplicated(subset='journal_title', keep=False)

If a duplicate is present, we would see the value True in the cell output above. Use the `duplicated()` method with a filter to isolate exact duplicates in the rows.

In [None]:
publications_df[publications_df.duplicated(subset='journal_title',keep=False)]

#### .drop_duplicates

The `.drop_duplicates` method keeps the first instance of the duplicate. Use with the `subset` parameter to identify a specific column value to use to identify the duplicates.

[pandas.DataFrame.drop_duplicates](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)


In [None]:
publications_df.drop_duplicates(subset='journal_title)

#### .dropna
The `.dropna()` method removes `NaN` values.

**Synatx:**\
`df.dropna(*, axis=0, how=<no_default>, thresh=<no_default>, subset=None, inplace=False, ignore_index=False)`

[pandas.DataFrame.dropna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html#pandas.DataFrame.dropna)

In [None]:
publications_df.dropna(subset='link_to_articles_citing_from_pubmed')

## Clean data

### String methods

| **Pandas String Method** | **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| df['column_name']`.str.lower()`         | makes the string in each row lowercase                                                                                |
| df['column_name']`.str.upper()`         | makes the string in each row uppercase                                                |
| df['column_name']`.str.title()`         | makes the string in each row titlecase                                                |
| df['column_name']`.str.replace('old string', 'new string')`      | replaces `old string` with `new string` for each row |
| df['column_name']`.str.contains('some string')`      | tests whether string in each row contains "some string" |
| df['column_name']`.str.split('delim')`          | returns a list of substrings separated by the given delimiter |
| df['column_name']`.str.join(list)`         | opposite of split(), joins the elements in the given list together using the string                                                                        |

#### .upper()

In [None]:
publications_df['journal_title'].str.upper()

#### .lower()

In [None]:
publications_df['journal_title'].str.lower()

#### .replace()

**Syntax:**\
`.replace('old string', 'new string')`

Let's remove the https:// from each value in the the **link_to_pubmed_abstract** field.

In [None]:
publications_df['link_to_pubmed_abstract']=publications_df['link_to_pubmed_abstract'].str.replace('https://','')

In [None]:
publications_df.sample(5)

#### .contains()
Isolate rows with columns that contain a specific word (another way to filter).

In [None]:
publications_df[publications_df['journal_title'].str.contains('Nature')]

In [None]:
publications_df[publications_df['journal_title'].str.contains('Nature microbiology')]

## Pandas Calculations

### .describe()

In [None]:
publications_df.describe(include='all')




| Pandas calculations | Explanation                         |
|----------|-------------------------------------|
| `.count()`    | Number of observations    |
| `.sum()`      | Sum of values                       |
| `.mean()`     | Mean of values                      |
| `.median()`   | Median of values         |
| `.min()`      | Minimum                             |
| `.max()`      | Maximum                             |
| `.mode()`     | Mode                                |
| `.std()`      | Unbiased standard deviation         |

### .value_counts()
Used to count the number of unique values in a column.

In [None]:
publications_df['journal_title'].value_counts()

### groupby

To group data by specific values and then use a Pandas calculation, use the  `.groupby()` method.

**Syntax:**\
`DataFrame.groupby(by=None, axis=<no_default>, level=None, as_index=True, sort=True, group_keys=True, observed=<no_default>, dropna=True)[source]`

[pandas.DataFrame.groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby)

In [None]:
publications_df.groupby('journal_title').count()

In [None]:
publications_df.groupby('journal_title').count()['journal_abbr']