In [1]:
# run this to shorten the data import from the files
import os
cwd = os.path.dirname(os.getcwd())+'/'
path_data = os.path.join(os.path.dirname(os.getcwd()), 'datasets/')


In [13]:
# exercise 01

"""
Import stock listing info from the NASDAQ

In this video, you learned how to use the pd.read_csv() function to import data from a csv file containing companies listed on the AmEx Stock Exchange into a pandas DataFrame. You can apply this same knowledge to import listing information in csv files from other stock exchanges.

The next step is to ensure that the contents of the DataFrame accurately reflect the meaning of your data. Two essential methods to understand your data are .head(), which displays the first five rows of your data by default, and .info(), which summarizes elements of a DataFrame such as content, data types, and missing values.

In this exercise, you will read the file nasdaq-listings.csv with data on companies listed on the NASDAQ and then diagnose issues with the imported data. You will fix these issues in the next exercise.
"""

# Instructions

"""

    Load pandas as pd.
    Use pd.read_csv() to load the file nasdaq-listings.csv into the variable nasdaq.
    Use .head() to display the first 10 rows of the data. Which data type would you expect pandas to assign to each column? What symbol is used to represent a missing value?
    Use .info() to identify dtype mismatches in the DataFrame summary. Specifically, are there any columns that should have a more appropriate type?

"""

# solution

# Import pandas library
import pandas as pd

# Import the data
nasdaq = pd.read_csv(path_data+'nasdaq-listings.csv')

# Display first 10 rows
print(nasdaq.head(10))

# Inspect nasdaq
nasdaq.info()

#----------------------------------#

# Conclusion

"""
Great! Note that a symbol other than np.nan is used to indicate missing values, and all columns are of type object or float64.
"""

  Stock Symbol           Company Name  Last Sale  Market Capitalization  \
0         AAPL             Apple Inc.     141.05           7.400000e+11   
1        GOOGL          Alphabet Inc.     840.18           5.810000e+11   
2         GOOG          Alphabet Inc.     823.56           5.690000e+11   
3         MSFT  Microsoft Corporation      64.95           5.020000e+11   
4         AMZN       Amazon.com, Inc.     884.67           4.220000e+11   
5           FB         Facebook, Inc.     139.39           4.030000e+11   
6        CMCSA    Comcast Corporation      37.14           1.760000e+11   
7         INTC      Intel Corporation      35.25           1.660000e+11   
8         CSCO    Cisco Systems, Inc.      32.42           1.620000e+11   
9         AMGN             Amgen Inc.     161.61           1.190000e+11   

  IPO Year             Sector  \
0     1980         Technology   
1      NAN         Technology   
2     2004         Technology   
3     1986         Technology   
4     199

'\nGreat! Note that a symbol other than np.nan is used to indicate missing values, and all columns are of type object or float64.\n'

In [3]:
# exercise 02

"""
How to fix the data import?

Two optional arguments that you can add to .read_csv() to better represent the data from an external file are:

    na_values: Converts a given string to np.nan, defaults to None
    parse_dates: Reads the data in a list of given columns as dtype datetime64, defaults to False

Which of the following steps should you take to make sure that the data imported from nasdaq-listings.csv are accurately represented?

The nasdaq DataFrame that you created in the previous exercise is available in your workspace.
"""

# Instructions

"""
Possible answers:
    
    Add the argument na_values=['NAN'] to pd.read_csv().
    
    Add the argument na_values=[0] to pd.read_csv().
    
    Add parse_dates=['Last Update'] to pd.read_csv().
    
    Both (1) and (3). {Answer}
"""

# solution



#----------------------------------#

# Conclusion

"""

"""

'\n\n'

In [15]:
# exercise 03

"""
Read data using .read_csv() with adequate parsing arguments

You have successfully identified the issues you must address when importing the given csv file.

In this exercise, you will once again load the NASDAQ data into a pandas DataFrame, but with a more robust function. pandas has been imported as pd.
"""

# Instructions

"""

    Read the file nasdaq-listings.csv into nasdaq with pd.read_csv(), adding the arguments na_values and parse_dates equal to the appropriate values. You should use 'NAN' for missing values, and parse dates in the Last Update column.
    Display and inspect the result using .head() and .info() to verify that the data has been imported correctly.

"""

# solution

# Import the data
nasdaq = pd.read_csv(path_data+'nasdaq-listings.csv', na_values='NAN', parse_dates=['Last Update'], date_format="%Y-%m-%d")

# Display the head of the data
print(nasdaq.head())

# Inspect the data
nasdaq.info()

#----------------------------------#

# Conclusion

"""
Great job! The data looks just right.
"""

  Stock Symbol           Company Name  Last Sale  Market Capitalization  \
0         AAPL             Apple Inc.     141.05           7.400000e+11   
1        GOOGL          Alphabet Inc.     840.18           5.810000e+11   
2         GOOG          Alphabet Inc.     823.56           5.690000e+11   
3         MSFT  Microsoft Corporation      64.95           5.020000e+11   
4         AMZN       Amazon.com, Inc.     884.67           4.220000e+11   

   IPO Year             Sector  \
0    1980.0         Technology   
1       NaN         Technology   
2    2004.0         Technology   
3    1986.0         Technology   
4    1997.0  Consumer Services   

                                          Industry Last Update  
0                           Computer Manufacturing     4/26/17  
1  Computer Software: Programming, Data Processing     4/24/17  
2  Computer Software: Programming, Data Processing     4/23/17  
3          Computer Software: Prepackaged Software     4/26/17  
4                  

'\nGreat job! The data looks just right.\n'

In [16]:
# exercise 04

"""
Load listing info from a single sheet

As you just learned, you can import data from a sheet of an Excel file with the pd.read_excel() function by assigning the optional sheet_name argument to an integer indicating its position or a string containing its name.

pandas.read_excel(file, sheet_name=0, na_values=None, ...)

Here, you will practice by importing NYSE data from a new file, listings.xlsx. pandas has been imported as pd.
"""

# Instructions

"""

    Read only the 'nyse' worksheet of 'listings.xlsx' where the symbol 'n/a' represents missing values. Assign the result to nyse.
    Display and inspect nyse with .head() and .info().

"""

# solution

# Import the data
nyse = pd.read_excel(path_data+'listings.xlsx', na_values='n/a', sheet_name='nyse')

# Display the head of the data
print(nyse.head())

# Inspect the data
nyse.info()

#----------------------------------#

# Conclusion

"""
Nice work! The nyse DataFrame contains 3147 entries.
"""

  Stock Symbol            Company Name  Last Sale  Market Capitalization  \
0          DDD  3D Systems Corporation      14.48           1.647165e+09   
1          MMM              3M Company     188.65           1.127366e+11   
2         WBAI         500.com Limited      13.96           5.793129e+08   
3         WUBA             58.com Inc.      36.11           5.225238e+09   
4          AHC   A.H. Belo Corporation       6.20           1.347351e+08   

   IPO Year             Sector  \
0       NaN         Technology   
1       NaN        Health Care   
2    2013.0  Consumer Services   
3    2013.0         Technology   
4       NaN  Consumer Services   

                                          Industry  
0          Computer Software: Prepackaged Software  
1                       Medical/Dental Instruments  
2            Services-Misc. Amusement & Recreation  
3  Computer Software: Programming, Data Processing  
4                             Newspapers/Magazines  
<class 'pandas.core.

  warn(msg)


'\nNice work! The nyse DataFrame contains 3147 entries.\n'

In [19]:
# exercise 05

"""
Load listing data from two sheets

The import process is just as intuitive when using the sheet_names attribute of a pd.ExcelFile() object.

Passing in a list as the sheet_name argument of pd.read_excel(), whether you assign the list to a variable holding the sheet_names attribute of a pd.ExcelFile() object or type the list out yourself, constructs a dictionary. In this dictionary, the keys are the names of the sheets, and the values are the DataFrames containing the data from the corresponding sheet. You can extract values from a dictionary by providing a particular key in brackets.

In this exercise, you will retrieve the list of stock exchanges from listings.xlsx and then use this list to read the data for all three exchanges into a dictionary. pandas has been imported as pd.
"""

# Instructions

"""

    Create a pd.ExcelFile() object using the file 'listings.xlsx' and assign to xls.
    Save the sheet_names attribute of xls as exchanges.
    Using exchanges to specify sheet names and n/a to specify missing values in pd.read_excel(), read the data from all sheets in xls, and assign to a dictionary listings.
    Inspect only the 'nasdaq' data in this new dictionary with .info().

"""

# solution

# Create pd.ExcelFile() object
xls = pd.ExcelFile(path_data+'listings.xlsx')

# Extract sheet names and store in exchanges
exchanges = xls.sheet_names

# Create listings dictionary with all sheet data
listings = pd.read_excel(xls, sheet_name=exchanges, na_values='n/a')

# Inspect NASDAQ listings
listings['nasdaq'].info()


#----------------------------------#

# Conclusion

"""
Great job! The other sheet names are amex and nyse.
"""

  warn(msg)
  warn(msg)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3167 entries, 0 to 3166
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Stock Symbol           3167 non-null   object 
 1   Company Name           3167 non-null   object 
 2   Last Sale              3165 non-null   float64
 3   Market Capitalization  3167 non-null   float64
 4   IPO Year               1386 non-null   float64
 5   Sector                 2767 non-null   object 
 6   Industry               2767 non-null   object 
dtypes: float64(3), object(4)
memory usage: 173.3+ KB


  warn(msg)


'\nGreat job! The other sheet names are amex and nyse.\n'

In [20]:
# exercise 06

"""
Load all listing data and iterate over key-value dictionary pairs

You already know that a pd.DataFrame() object is a two-dimensional labeled data structure. As you saw in the video, the pd.concat() function is used to concatenate, or vertically combine, two or more DataFrames. You can also use broadcasting to add new columns to DataFrames.

In this exercise, you will practice using this new pandas function with the data from the NYSE and NASDAQ exchanges. pandas has been imported as pd.
"""

# Instructions

"""

    Import data in listings.xlsx from sheets 'nyse' and 'nasdaq' into the variables nyse and nasdaq. Read 'n/a' to represent missing values.
    Inspect the contents of both DataFrames with .info() to find out how many companies are reported.
    With broadcasting, create a new reference column called 'Exchange' holding the values 'NYSE' or 'NASDAQ' for each DataFrame.
    Use pd.concat() to concatenate the nyse and nasdaq DataFrames, in that order, and assign to combined_listings.

"""

# solution

# Import the NYSE and NASDAQ listings
nyse = pd.read_excel(path_data+'listings.xlsx', sheet_name='nyse', na_values='n/a')
nasdaq = pd.read_excel(path_data+'listings.xlsx', sheet_name='nasdaq', na_values='n/a')

# Inspect nyse and nasdaq
nyse.info()
nasdaq.info()

# Add Exchange reference columns
nyse['Exchange'] = 'NYSE'
nasdaq['Exchange'] = 'NASDAQ'

# Concatenate DataFrames  
combined_listings = pd.concat([nyse, nasdaq]) 

#----------------------------------#

# Conclusion

"""
Great job. Note that concatenating nasdaq and nyse, in that order, would have produced a different result.
"""

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3147 entries, 0 to 3146
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Stock Symbol           3147 non-null   object 
 1   Company Name           3147 non-null   object 
 2   Last Sale              3079 non-null   float64
 3   Market Capitalization  3147 non-null   float64
 4   IPO Year               1361 non-null   float64
 5   Sector                 2177 non-null   object 
 6   Industry               2177 non-null   object 
dtypes: float64(3), object(4)
memory usage: 172.2+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3167 entries, 0 to 3166
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Stock Symbol           3167 non-null   object 
 1   Company Name           3167 non-null   object 
 2   Last Sale              3165 non-null   float64
 3   M

  warn(msg)
  warn(msg)


'\nGreat job. Note that concatenating nasdaq and nyse, in that order, would have produced a different result.\n'

In [22]:
# exercise 07

"""
How many companies are listed on the NYSE and NASDAQ?

Before moving on, let's step back to examine the size of our data. Based on the previous exercise, how many companies are listed altogether on the NYSE and NASDAQ exchanges?

The nyse, nasdaq, and combined_listings DataFrames have been loaded into your workspace.
"""

# Instructions

"""
Possible answers
    
    3,167
    
    3,147
    
    6,314 {Answer}
None of the above
"""

# solution

combined_listings.info()

#----------------------------------#

# Conclusion

"""
Correct! According to combined_listings.info(), there are 6314 entries total.
"""

<class 'pandas.core.frame.DataFrame'>
Index: 6314 entries, 0 to 3166
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Stock Symbol           6314 non-null   object 
 1   Company Name           6314 non-null   object 
 2   Last Sale              6244 non-null   float64
 3   Market Capitalization  6314 non-null   float64
 4   IPO Year               2747 non-null   float64
 5   Sector                 4944 non-null   object 
 6   Industry               4944 non-null   object 
 7   Exchange               6314 non-null   object 
dtypes: float64(3), object(5)
memory usage: 444.0+ KB


'\nCorrect! According to combined_listings.info(), there are 6314 entries total.\n'

In [23]:
# exercise 08

"""
Automate the loading and combining of data from multiple Excel worksheets

You are now ready to automate the import process of listing information from all three exchanges in the Excel file listings.xlsx by implementing a for loop. Let's look at what you'll do:

    Retrieve the sheet names of a pd.ExcelFile() object using its sheet_names attribute.
    Create an empty list.
    Write a for loop that iterates through these sheet names to read the data from the corresponding sheet name in the Excel file into a variable. Add a reference column, if desired. Append the contents of this variable to the list with each iteration.
    Concatenate the DataFrames in the list.

As always, refer to the previous exercises in this chapter or the pandas documentation if you need any help. pandas has been imported as pd.
"""

# Instructions

"""

    Create the pd.ExcelFile() object using the file listings.xlsx and assign to the variable xls.
    Retrieve the sheet names from the .sheet_names attribute of xls and assign to exchanges.
    Create an empty list and assign to the variable listings.
    Iterate over exchanges using a for loop with exchange as iterator variable. In each iteration:
        Use pd.read_excel() with xls as the the data source, exchange as the sheet_name argument, and 'n/a' as na_values to address missing values. Assign the result to listing.
        Create a new column in listing called 'Exchange' with the value exchange (the iterator variable).
        Append the resulting listing DataFrame to listings.
    Use pd.concat() to concatenate the contents of listings and assign to listing_data.
    Inspect the contents of listing_data using .info().

"""

# solution

# Create the pd.ExcelFile() object
xls = pd.ExcelFile(path_data+'listings.xlsx')

# Extract the sheet names from xls
exchanges = xls.sheet_names

# Create an empty list: listings
listings = []

# Import the data
for exchange in exchanges:
    listing = pd.read_excel(xls, sheet_name=exchange, na_values='n/a')
    listing['Exchange'] = exchange
    listings.append(listing)

# Concatenate the listings: listing_data
listing_data = pd.concat(listings)

# Inspect the results
listing_data.info()

#----------------------------------#

# Conclusion

"""
As you would expect, listing_data has 6674 entries, which is the sum of the number of rows in the 'nasdaq', 'nyse', and 'amex' worksheets.
"""

<class 'pandas.core.frame.DataFrame'>
Index: 6674 entries, 0 to 3146
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Stock Symbol           6674 non-null   object 
 1   Company Name           6674 non-null   object 
 2   Last Sale              6590 non-null   float64
 3   Market Capitalization  6674 non-null   float64
 4   IPO Year               2852 non-null   float64
 5   Sector                 5182 non-null   object 
 6   Industry               5182 non-null   object 
 7   Exchange               6674 non-null   object 
dtypes: float64(3), object(5)
memory usage: 469.3+ KB


  warn(msg)
  warn(msg)
  warn(msg)


"\nAs you would expect, listing_data has 6674 entries, which is the sum of the number of rows in the 'nasdaq', 'nyse', and 'amex' worksheets.\n"