In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Pandas and Excel:
Pandas treats excel workbook as a `Dictionary` where-

1.`key = Sheete name`

2.`value = DataFrame`

In [2]:
## Before importing the excel file you can see the sheet names of that excel file

pd.ExcelFile('fcc-new-coder-survey.xlsx').sheet_names

['2016', '2017']

### EX 1:

In this exercise, you'll create a data frame from a "base case" Excel file: one with a `single sheet` of tabular data. The `fcc_survey.xlsx` file here has a sample of responses from `FreeCodeCamp's` annual New Developer Survey. This survey asks participants about their demographics, education, work and home life, plus questions about how they're learning to code. Let's load all of it.

In [3]:
# Read spreadsheet and assign it to survey_responses
survey_responses = pd.read_excel('fcc-new-coder-survey.xlsx')

In [None]:
# View the head of the data frame
#survey_responses.head()

### Ex 2: Load a portion of a spreadsheet

`Spreadsheets` meant to be read by people often have `multiple tables`, e.g., a small business might keep an inventory workbook with tables for different product types on a `single sheet`. Even tabular data may have `header rows` of `metadata`, like the `New Developer Survey` data here. While the metadata is useful, we don't want it in a data frame.

You'll use read_excel()'s `skiprows` keyword to get just the data. You'll also create a `string` to pass to `usecols` to get only columns `AD and AW through BA`, about future job goals.

In [6]:
## We can see from opening the spreadsheet in excel, the first two rows there are metadata
## Load fcc_survey_headers.xlsx', setting skiprows and usecols to skip the first two rows of metadata 

survey_responses = pd.read_excel("fcc-new-coder-survey.xlsx", skiprows = 2)
#survey_responses.head()

In [None]:
# Create a single string, specifying pandas usecols argument to get only columns AD and AW through BA, about future job goals. 

col_string = "AD, AW:BA"

survey_responses = pd.read_excel("fcc-new-coder-survey.xlsx", skiprows = 2, usecols = col_string)
#survey_responses.head()

### Getting data from multiple worksheets

An Excel workbook may contain `multiple sheets` of related data. The New Developer Survey response workbook has sheets for different years which are `2016 and 2017`. 

### Ex 3: Select a single sheet

Because `read_excel()` loads only the first sheet by `default`, you've already gotten survey responses for 2016. Now, you'll create a data frame of 2017 responses using read_excel()'s `sheet_name` argument in a couple different ways.

1.Create a data frame from the second workbook sheet by passing the sheet's `position` to sheet_name.

2.Create a data frame from the 2017 sheet by providing the sheet's `name` to read_excel().

In [9]:
# Create df from second worksheet by referencing its position

responses_2017 = pd.read_excel("fcc-new-coder-survey.xlsx",sheet_name = 1, skiprows = 2)
responses_2017.head()

Unnamed: 0,Age,AttendedBootcamp,BootcampFinish,BootcampLoanYesNo,BootcampName,BootcampRecommend,ChildrenNumber,CityPopulation,CodeEventConferences,CodeEventDjangoGirls,...,ResourcePluralSight,ResourceSkillCrush,ResourceStackOverflow,ResourceTreehouse,ResourceUdacity,ResourceUdemy,ResourceW3Schools,SchoolDegree,SchoolMajor,StudentDebtOwe
0,27.0,0.0,,,,,,more than 1 million,,,...,,,,,,1.0,1.0,"some college credit, no degree",,
1,34.0,0.0,,,,,,"less than 100,000",,,...,,,1.0,,,1.0,1.0,"some college credit, no degree",,
2,21.0,0.0,,,,,,more than 1 million,,,...,,,,,1.0,1.0,,high school diploma or equivalent (GED),,
3,26.0,0.0,,,,,,"between 100,000 and 1 million",,,...,,,1.0,,,,,"some college credit, no degree",,
4,20.0,0.0,,,,,,"between 100,000 and 1 million",,,...,,,1.0,,,,,bachelor's degree,Information Technology,


In [None]:
# Create a data frame from the 2017 sheet by providing the sheet's name to read_excel().

responses_2017 = pd.read_excel("fcc-new-coder-survey.xlsx",sheet_name = "2017", skiprows = 2)
#responses_2017.head()

In [None]:
responses_2017["HasDebt"].isnull().sum()

###  Select multiple sheets

So far, you've read Excel files `one sheet` at a time, which lets you customize import arguments for each sheet. But if an Excel file has some sheets that you want loaded with the same parameters, you can get them in one go by passing a `list` of their `names` or `indices` to read_excel()'s `sheet_name` keyword. To get them all, pass `None`. You'll practice both methods to get data from fcc_survey.xlsx, which has multiple sheets of similarly-formatted data.

Workbooks meant primarily for human readers, not machines, may store data about a single subject across `multiple` sheets. For example, a file may have a different sheet of transactions for each region or year in which a business operated.

The FreeCodeCamp New Developer Survey file is set up similarly, with samples of responses from different years in different sheets. Your task here is to compile them in one data frame for analysis.

After the Ex 4, All sheets will be read into the ordered dictionary `all_data`, where `sheet names` are `keys` and `data frames` are `values`, so you can get data frames with the `values()` method.

### Ex 4:

1.Load both the 2016 and 2017 sheets by `name` with a `list` and one call to read_excel().

2.Load the 2016 sheet by its `position` (0) and 2017 by `name`. Note the sheet names in the result.

3.Load all sheets in the Excel file `without` listing them all.

In [None]:
## all the datasets Below will give the same answer, we tried to load in different ways

In [11]:
## Load both the 2016 and 2017 sheets by name with a list and one call to read_excel().

all_data = pd.read_excel("fcc-new-coder-survey.xlsx", skiprows = 2, sheet_name = ["2016", "2017"])


In [None]:
## Load the 2016 sheet by its position (0) and 2017 by name. Note the sheet names in the result.

all_data = pd.read_excel("fcc-new-coder-survey.xlsx", skiprows = 2, sheet_name = [0, "2017"])


In [12]:
## Load all sheets in the Excel file without listing them all.

all_data = pd.read_excel("fcc-new-coder-survey.xlsx", skiprows = 2, sheet_name = None)

In [13]:
## We can confirm that our all_data is a dictionary

type(all_data)

dict

In [None]:
# View the sheet names in all_survey_data

all_data.keys()

In [None]:
# View all the data frames 
all_data.values()

### Ex 5: 

To get the dataframes we need to `iterate` through the dictionary `all_data.values()`, to get all the dataframes you need to do following steps ---

1.Create an `empty` data frame, give it a variable name.

2.Set up a for loop to `iterate` through the `values` in the `all_data` dictionary.

3.`Append` each data frame to the empty dataframe and reassign the result to the same variable name.

In [None]:
# Create an empty data frame, give it a variable name.

all_dataframes = pd.DataFrame()

# Set up a for loop to iterate through the values in the all_data dictionary.

for df in all_data.values():
    
    # Print the number of rows being added
    print("Adding {} rows".format(df.shape[0]))
    
    # Append each data frame to the empty dataframe and reassign the result to the same variable name.
    all_dataframes = all_dataframes.append(df)

    

In [None]:
all_dataframes.head()

In [None]:
## Creating a dataset for our next example

In [None]:
survey_responses.loc[survey_responses["HasDebt"].isnull()]

In [None]:
survey_responses['AttendedBootcamp'].isnull().sum()

In [None]:
survey_responses_new = survey_responses.dropna(subset = ["HasDebt", 'AttendedBootcamp'])

In [None]:
fcc_survey_subset = survey_responses_new[['ID.x', 'AttendedBootcamp', 'HasDebt', 'HasFinancialDependents', 'HasHomeMortgage', 'HasStudentDebt']]
fcc_survey_subset

In [None]:
# determining the name of the file
file_name = 'fcc_survey_subset.xlsx'
  
# saving the excel file
fcc_survey_subset.to_excel(file_name, index = False)
print('DataFrame is written to Excel File successfully.')

### Set Boolean columns

Datasets may have columns that are most accurately modeled as `Boolean` values. However, pandas usually loads these as `floats` by `default`, since defaulting to `Booleans` may have undesired effects like turning `NA` values into `Trues`. `fcc_survey_subset.xlsx` contains a string `ID` column and several `True/False` columns indicating `financial stressors`. 

In [None]:
# Load the data
survey_data = pd.read_excel("fcc_survey_subset.xlsx")
survey_data.head()

In [None]:
survey_data.info()

In [None]:
## Count NA values in each column of survey_data with isna() and sum()
## Note which columns besides ID.x, if any, have zero NAs.

survey_data.isnull().sum()

In [None]:
## We can see HasDebt and AttendedBootcamp columns has zero NAs

# Set read_excel()'s dtype argument to load the HasDebt and AttendedBootcamp column as Boolean data.

survey_data = pd.read_excel("fcc_survey_subset.xlsx", dtype = {"HasDebt": bool, "AttendedBootcamp": bool})
survey_data.head()

### Set custom true/false values

In Boolean columns, pandas automatically `recognizes` certain values, like `"TRUE" and 1`, as `True`, and others, like `"FALSE" and 0`, as `False`. Some datasets, like survey data, can use `unrecognized` values, such as `"Yes" and "No"`. Unrecognized values in a Boolean column are also changed to `True`.

For practice purposes, some Boolean columns in the New Developer Survey have been coded this way. 

Use `read_excel()` 's `true_values` argument to set custom `True` values . Use `false_values` to set custom `False` values. Each takes a `list` of values to treat as `True / False` , respectively. Custom True / False values are only `applied` to columns set as `Boolean`.

In [None]:
# Load file with Yes as a True value and No as a False value
survey_subset = pd.read_excel("fcc_survey_subset.xlsx",
                              dtype={"HasDebt": bool,
                              "AttendedBootcamp": bool},
                              true_values = ["Yes"],
                              false_values = ["No"])

# View the data
survey_subset.head()

### Modifying imports: Parsing dates

pandas does not infer that columns contain datetime data; it interprets them as object or string data unless told otherwise. Correctly modeling datetimes is easy when they are in a standard format -- we can use the parse_dates argument to tell read_excel() to read columns as datetime data.

### Ex 6:

The New Developer Survey responses contain some columns with easy-to-parse timestamps. In this exercise, you'll make sure they're the right data type.

1.Load `fcc-new-coder-survey.xlsx`, making sure that the `Part1StartTime` column is parsed as `datetime` data

2.View the first few values of the `survey_data.Part1StartTime` to make sure it contains `datetimes`.

In [None]:
# Read spreadsheet and assign it to survey_responses
survey_responses = pd.read_excel('fcc-new-coder-survey.xlsx', skiprows = 2)
#survey_responses.head()

In [None]:
survey_responses["Part1StartTime"].head()

In [None]:
## We can see that this dates are count as an object
## We need to convert them as datetime data

survey_responses = pd.read_excel('fcc-new-coder-survey.xlsx', skiprows = 2, parse_dates = ["Part1StartTime"])

In [None]:
# View the first few values of the survey_data.Part1StartTime to make sure it contains datetimes.
survey_responses["Part1StartTime"].head()

### Get datetimes from multiple columns

Sometimes, datetime data is `split` across columns. A `dataset` might have a `date` and a `time` column, or a `date` may be split into `year, month, and day` columns.

### Ex 7:

A column in the `datetime` dataset has been split so that dates are in one column, `Part2StartDate`, and times are in another, `Part2StartTime`. Your task is to use read_excel()'s parse_dates argument to combine them into one datetime column with a new name.

1.Create a dictionary, `datetime_cols` indicating that the new column `Part2Start` should consist of `Part2StartDate` and `Part2StartTime`

2.Load the `datetime` excel file, supplying the dictionary to the `parse_dates` argument to create a new `Part2Start` column.

3.Examine survey_data's Part2EndTime column to see the data type and date format. Choose the code that describes the date format in Part2EndTime.

In [None]:
## Loading the excel file

survey_dates = pd.read_excel("datetime.xlsx")
survey_dates

In [None]:
survey_dates.info()

In [None]:
## we can see that all of the dates are objects here we need to change it to date time object

## But first we need to--
# Create a dictionary, indicating that the column Part2StartTime should consist of Part2StartDate and Part2StartTime



In [None]:
## Load the datetime excel file, 
## Supplying the dictionary to the parse_dates argument 
## Part1StartTime, Part1EndTime and to create a new Part2Start column


survey_date = pd.read_excel("datetime.xlsx", parse_dates = ["Part1StartTime", "Part1EndTime", 
                                                            ['Part2StartDate', 'Part2StartTime']])
survey_date

In [None]:
survey_date.info()

In [None]:
survey_date["Part2StartTime"] = survey_date["Part2StartDate_Part2StartTime"]

In [None]:
survey_date= survey_date.drop("Part2StartDate_Part2StartTime", axis = 1)

In [None]:
survey_date.info()

### Parse non-standard date formats

So far, you've parsed dates that pandas could interpret `automatically`. But if a date is in a `non-standard` format, like `19991231` for `December 31, 1999`, it can't be parsed at the import stage. Instead, use `pd.to_datetime()` to convert strings to dates after import.

The New Developer Survey data has been loaded as survey_data but contains an unparsed datetime field. We'll use to_datetime() to convert it, passing in the column to convert and a string representing the date format used.

For more on date format codes, see this reference: https://strftime.org/. Some common codes are `year (%Y), month (%m), day (%d), hour (%H), minute (%M), and second (%S)`.

### Ex 8:

1.Examine `survey_data's` `Part2EndTime` column to see the `data type` and `date format`. Choose the code that describes the date format in Part2EndTime.

2.`Parse` Part2EndTime using `pd.to_datetime()`, the format keyword argument, and the `format` string you just identified. Assign the result back to the `Part2EndTime` column.

3.Print the head of Part2EndTime to confirm the column now contains datetime values.

In [None]:
# Examine survey_data's Part2EndTime column to see the data type and date format. 
# Choose the code that describes the date format in Part2EndTime.

survey_date["Part2EndTime"].head()

In [None]:
## The date format is --- "%m%d%Y %H:%M:%S"

In [None]:
# Parse Part2EndTime using pd.to_datetime(), 
# the format keyword argument, and the format string you just identified. Assign the result back to the Part2EndTime column.

survey_date["Part2EndTime"] = pd.to_datetime(survey_date["Part2EndTime"], format = "%m%d%Y %H:%M:%S")

# Print the head of Part2EndTime to confirm the column now contains datetime values.

survey_date["Part2EndTime"].head()