<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Python_Data_Analytics_Course/blob/main/2_Advanced/03_Pandas_Data_Management.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Pandas Data Management

Load data.

In [1]:
# Importing Libraries
import pandas as pd
from datasets import load_dataset
import matplotlib.pyplot as plt

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Data Cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])

# DataFrame Copy
df_original = df.copy()

  from .autonotebook import tqdm as notebook_tqdm


## Copy

Recall from the last lesson, when we filled in missing values for median salary.

Here let's make a new dataframe `df_altered` and only make changes to it.

In [3]:
# Create new dataframe
df_altered = df_original

df_altered.loc[:5,'salary_year_avg']

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
Name: salary_year_avg, dtype: float64

Let's fill in missing values with the median value.

In [4]:
# Calculating the median salary
median_salary = df_altered['salary_year_avg'].median()

# Filling the missing values with the median salary
df_altered['salary_year_avg'] = df_altered.loc[:,'salary_year_avg'].fillna(median_salary)

In [5]:
df_altered['salary_year_avg'] = df_altered['salary_year_avg'].fillna(median_salary)

Now let's inspect the altered DataFrame.

In [6]:
df_altered.loc[:5,'salary_year_avg']

0    115000.0
1    115000.0
2    115000.0
3    115000.0
4    115000.0
5    115000.0
Name: salary_year_avg, dtype: float64

That was good...

But what about the original...

In [7]:
df_original.loc[:5,'salary_year_avg']

0    115000.0
1    115000.0
2    115000.0
3    115000.0
4    115000.0
5    115000.0
Name: salary_year_avg, dtype: float64

Holdup!! How the heck did `df_original` get altered!?!

Well both the variables of `df_original` and `df_altered` are referencing the same DataFrame.

In [9]:
print('ID of df_original:               ', id(df_original))
print('ID of df_altered:                ', id(df_altered))
print('Are the two dataframes the same? ', id(df_original) == id(df_altered))

ID of df_original:                1789144651856
ID of df_altered:                 1789144651856
Are the two dataframes the same?  True


Instead we can use the .copy() method

- `copy()`: Copy a DataFrame.

In [10]:
df_original = df.copy()
df_altered = df_original.copy()

print('ID of df_original:               ', id(df_original))
print('ID of df_altered:                ', id(df_altered))
print('Are the two dataframes the same? ', id(df_original) == id(df_altered))

ID of df_original:                1789748347920
ID of df_altered:                 1789748347728
Are the two dataframes the same?  False


Now when we do this same operation:

In [None]:
# Calculating the median salary
median_salary = df_altered['salary_year_avg'].median()

# Filling the missing values with the median salary
df_altered['salary_year_avg'] = df_altered['salary_year_avg'].fillna(median_salary)

df_altered.loc[:5,'salary_year_avg']

0    115000.0
1    115000.0
2    115000.0
3    115000.0
4    115000.0
5    115000.0
Name: salary_year_avg, dtype: float64

The original dataframe doesn't get altered!

In [None]:
df_original.loc[:5,'salary_year_avg']

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
Name: salary_year_avg, dtype: float64

Now that we've created a copy of our data, we want to start our analysis. But if we have a large set of data we only want to take a subset of data to make it more manageable. We can use `sample()` to get a random sample of the data.

## Sample

### Notes

* `sample()`: Random sample of items.

### Examples

Let's get a random sample of the data. You could get a sample with a fixed row number.

In [11]:
df.sample(n=5)

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
51077,Data Engineer,Principal Data Engineer,Anywhere,via LinkedIn,Full-time,True,India,2023-06-21 06:13:06,False,False,India,,,,Skillsoft,"['python', 'r', 'sql', 'sql server', 'azure', ...","{'cloud': ['azure'], 'databases': ['sql server..."
163840,Data Engineer,Interim Data Engineer | SaaS,"London, UK",via MyArklaMiss Jobs,Full-time,False,United Kingdom,2023-07-31 16:01:20,True,False,United Kingdom,,,,Agora Talent,['excel'],{'analyst_tools': ['excel']}
391134,Data Engineer,Sql / Bi Entwickler (m/w/d) - It Data Engineer,"Freiburg im Breisgau, Germany",via My ArkLaMiss Jobs,Full-time,False,Germany,2023-02-16 18:15:09,True,False,Germany,,,,Ratbacher GmbH,['sql'],{'programming': ['sql']}
594444,Data Engineer,Data Engineer (3640 USD/Mes),Anywhere,via LinkedIn,Full-time,True,Mexico,2023-10-31 09:09:04,True,False,Mexico,,,,Listopro,"['sql', 'python', 'r', 'redshift', 'tableau', ...","{'analyst_tools': ['tableau', 'excel'], 'cloud..."
313443,Data Scientist,Techno Functional Analyst,Singapore,via Jooble,Full-time,False,Singapore,2023-01-10 17:42:12,True,False,Singapore,,,,Tata Consultancy Services Asia Pacific Pte. Ltd.,"['python', 'sql', 'nosql', 'r', 'scala', 'sas'...","{'analyst_tools': ['sas', 'tableau', 'cognos']..."


Or you can randomly select a fraction of the data (e.g., 10% of the rows), with or without replacement.

In [13]:
df.sample(frac=0.1, replace=False)

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
198572,Data Analyst,Alternance - Chargé Data Analyst / Business In...,"Issy-les-Moulineaux, France",via TalentDetection,Full-time,False,France,2023-07-07 15:31:19,False,False,France,,,,La Banque Postale Assurances Iard,['power bi'],{'analyst_tools': ['power bi']}
497082,Data Scientist,Data scientist experimenté H/F,France,via BeBee,Full-time,False,France,2023-08-24 11:35:30,False,False,France,,,,Pôle Emploi,,
320530,Machine Learning Engineer,Machine Learning Engineer,Anywhere,via LinkedIn,Full-time,True,Argentina,2023-06-06 17:31:16,False,False,Argentina,,,,Darwoft,"['python', 'sql', 'pandas', 'pytorch', 'tensor...","{'libraries': ['pandas', 'pytorch', 'tensorflo..."
191657,Data Scientist,Data scientist CDD F/H,"Bois-Colombes, France",via LinkedIn,Full-time,False,France,2023-01-12 15:16:16,False,False,France,,,,Abeille Assurances,"['sas', 'sas', 'python', 'sql']","{'analyst_tools': ['sas'], 'programming': ['sa..."
31269,Data Analyst,Data Analyst,Hong Kong,via LinkedIn,Full-time,False,Hong Kong,2023-05-08 13:20:06,False,False,Hong Kong,,,,Hong Kong Technology Venture Company Ltd,"['sql', 'python', 'tableau']","{'analyst_tools': ['tableau'], 'programming': ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
485446,Data Engineer,Data Engineer,"Essendon Fields VIC, Australia",via Trabajo.org,Full-time,False,Australia,2023-05-09 15:01:57,True,False,Australia,,,,LINFOX,"['aws', 'snowflake']","{'cloud': ['aws', 'snowflake']}"
439372,Data Analyst,Data Analyst with Data analytics and Visualiza...,"Irving, TX",via Dice,Contractor,False,"Texas, United States",2023-03-15 16:01:37,False,False,United States,,,,"Techno-Comp, Inc.","['python', 'sql', 'nosql']","{'programming': ['python', 'sql', 'nosql']}"
675415,Data Analyst,Stage - Assistant Data Analyst Media H/F,"Issy-les-Moulineaux, France",via LinkedIn,Full-time,False,France,2023-08-08 10:20:18,False,False,France,,,,Nestlé,"['python', 'sql', 'databricks']","{'cloud': ['databricks'], 'programming': ['pyt..."
645905,Data Engineer,Azure Data Engineer,Anywhere,via LinkedIn,Full-time,True,Lithuania,2023-03-16 12:58:03,True,False,Lithuania,,,,Nortal,"['sql', 'sql server', 'azure', 'power bi']","{'analyst_tools': ['power bi'], 'cloud': ['azu..."


Now you can analyze these subsets of data.