# Stock Ticker Symbol Data cleaning

The ticker symbol data is downloaded from [NASDAQ.com](http://www.nasdaq.com/screening/company-list.aspx). For this project we would like to select stocks with marketcapital larger than 10 billion.

In [1]:
import requests
import pandas as pd
import numpy as np
import time
import feather
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Read data 

In [2]:
# read ticker symbols from all US stocks, data source (http://www.nasdaq.com/screening/company-list.aspx)
tickers_NASDAQ = pd.read_csv("../data/data_Ticker/NASDAQ.csv", skiprows=None)
tickers_NASDAQ ["Exchange"] = "NASDAQ"
tickers_NASDAQ = tickers_NASDAQ.drop("Unnamed: 8", axis=1)

tickers_NYSE = pd.read_csv("../data/data_Ticker/NYSE.csv", skiprows=None)
tickers_NYSE ["Exchange"] = "NYSE"
tickers_NYSE = tickers_NYSE.drop("Unnamed: 8", axis=1)

tickers_AMEX = pd.read_csv("../data/data_Ticker/AMEX.csv", skiprows=None)
tickers_AMEX ["Exchange"] = "AMEX"
tickers_AMEX = tickers_AMEX.drop("Unnamed: 8", axis=1)

# combine all symbols from three exchanges into one dataframe
tickers = tickers_NASDAQ.append(tickers_NYSE)
tickers = tickers.append(tickers)

In [3]:
tickers.head(5)

Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Exchange
0,PIH,"1347 Property Insurance Holdings, Inc.",7.4,$44.29M,2014.0,Finance,Property-Casualty Insurers,http://www.nasdaq.com/symbol/pih,NASDAQ
1,TURN,180 Degree Capital Corp.,2.03,$63.18M,,Finance,Finance/Investors Services,http://www.nasdaq.com/symbol/turn,NASDAQ
2,FLWS,"1-800 FLOWERS.COM, Inc.",9.5,$613.93M,1999.0,Consumer Services,Other Specialty Stores,http://www.nasdaq.com/symbol/flws,NASDAQ
3,FCCY,1st Constitution Bancorp (NJ),17.95,$144.91M,,Finance,Savings Institutions,http://www.nasdaq.com/symbol/fccy,NASDAQ
4,SRCE,1st Source Corporation,49.88,$1.29B,,Finance,Major Banks,http://www.nasdaq.com/symbol/srce,NASDAQ


### Clean data
There are many duplicated symbols, uncommon symbols, and one symbol with multiple classes of stocks. For example, all symbols for "Bank of America Corporation" is shown below.

In [4]:
tickers.loc[tickers.Name == "Bank of America Corporation"]

Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Exchange
320,BAC,Bank of America Corporation,26.62,$277.66B,,Finance,Major Banks,http://www.nasdaq.com/symbol/bac,NYSE
321,BAC.WS.A,Bank of America Corporation,,,,,,http://www.nasdaq.com/symbol/bac.ws.a,NYSE
322,BAC.WS.B,Bank of America Corporation,,,,,,http://www.nasdaq.com/symbol/bac.ws.b,NYSE
323,BAC^A,Bank of America Corporation,26.94,,,,,http://www.nasdaq.com/symbol/bac^a,NYSE
324,BAC^C,Bank of America Corporation,27.28,,,,,http://www.nasdaq.com/symbol/bac^c,NYSE
325,BAC^D,Bank of America Corporation,25.98,,,,,http://www.nasdaq.com/symbol/bac^d,NYSE
326,BAC^E,Bank of America Corporation,23.73,,,,,http://www.nasdaq.com/symbol/bac^e,NYSE
327,BAC^I,Bank of America Corporation,26.61,,,,,http://www.nasdaq.com/symbol/bac^i,NYSE
328,BAC^L,Bank of America Corporation,1316.0,,,,,http://www.nasdaq.com/symbol/bac^l,NYSE
329,BAC^W,Bank of America Corporation,26.81,,,Finance,Major Banks,http://www.nasdaq.com/symbol/bac^w,NYSE


__Data cleaning:__
    
* remove symbols with `^` 
* reomve symbols without stock price `LastSale == n/a`
* reomve symbols without market captical `MarketCap == n/a`

In [5]:
# remove stocks with ^ signs
data_size = tickers.shape
data_size = np.arange(data_size[0])   # give a integer vector for indexing in tickers.
tickers = tickers.reset_index()     # reset index after joining multiple dataframes
tickers.drop(list(data_size[tickers.Symbol.str.contains("\^").tolist()]),axis = 0, inplace=True)
tickers = tickers.reset_index()    # reset index after deleting some rows

# drop colums created when reseting index.
tickers.drop(["level_0","index"], axis=1, inplace=True)  

# reomve symbols without market captical `MarketCap == n/a`
tmp = tickers.shape
tmp = np.arange(tmp[0])
tickers.drop(list(tmp[tickers.MarketCap=="n/a"]),axis = 0, inplace=True)
tickers = tickers.reset_index()

# delete duplicated names
tickers.drop_duplicates(["Name"], inplace=True)

# delete rows with Sector == n/a
tmp = tickers.shape
tmp = np.arange(tmp[0])
tickers = tickers.reset_index()
tickers.drop(["level_0","index"], axis=1, inplace=True)
tickers.drop(list(tmp[tickers.Sector=="n/a"]),axis = 0, inplace=True)
tickers = tickers.reset_index()

# drop colums created when reseting index.
tickers.drop(["index"], axis=1, inplace=True)
tickers.head(5)

tickers.describe()

Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Exchange
0,PIH,"1347 Property Insurance Holdings, Inc.",7.4,$44.29M,2014.0,Finance,Property-Casualty Insurers,http://www.nasdaq.com/symbol/pih,NASDAQ
1,TURN,180 Degree Capital Corp.,2.03,$63.18M,,Finance,Finance/Investors Services,http://www.nasdaq.com/symbol/turn,NASDAQ
2,FLWS,"1-800 FLOWERS.COM, Inc.",9.5,$613.93M,1999.0,Consumer Services,Other Specialty Stores,http://www.nasdaq.com/symbol/flws,NASDAQ
3,FCCY,1st Constitution Bancorp (NJ),17.95,$144.91M,,Finance,Savings Institutions,http://www.nasdaq.com/symbol/fccy,NASDAQ
4,SRCE,1st Source Corporation,49.88,$1.29B,,Finance,Major Banks,http://www.nasdaq.com/symbol/srce,NASDAQ


Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Exchange
count,4508,4508,4508.0,4508,4508.0,4508,4508,4508,4508
unique,4508,4508,3225.0,3442,41.0,12,136,4508,2
top,CORT,"YogaWorks, Inc.",3.75,$1.37B,,Finance,Major Pharmaceuticals,http://www.nasdaq.com/symbol/ts,NASDAQ
freq,1,1,8.0,13,2312.0,795,383,1,2529


In [6]:
tickers.loc[tickers.Name == "Bank of America Corporation"]

Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Exchange
2736,BAC,Bank of America Corporation,26.62,$277.66B,,Finance,Major Banks,http://www.nasdaq.com/symbol/bac,NYSE


### Select tickers with only market cap larger than 10 Billion

Here we only select stocks with market cap that is larger than 10 billion in 2017-11. To do that, we can first select rows in which `MarketCap` column contains `B` (billion). Then covert the type to numeric and selecting values larger than 10.

In [7]:
# select ticker symbols with `MarketCap` contains "B"
tickers_Billion = tickers.loc[tickers.MarketCap.str.contains("B")]
tickers_Billion.describe()

# see the filtered symbols
tickers_Billion = tickers_Billion.reset_index();
tickers_Billion.drop(["index"], axis=1, inplace=True)
tickers_Billion.head(5)

Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Exchange
count,2253,2253,2253.0,2253,2253.0,2253,2253,2253,2253
unique,2253,2253,2011.0,1226,41.0,12,129,2253,2
top,XON,Acco Brands Corporation,44.75,$1.37B,,Consumer Services,Real Estate Investment Trusts,http://www.nasdaq.com/symbol/eqm,NYSE
freq,1,1,4.0,13,1191.0,394,142,1,1465


Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Exchange
0,SRCE,1st Source Corporation,49.88,$1.29B,,Finance,Major Banks,http://www.nasdaq.com/symbol/srce,NASDAQ
1,TWOU,"2U, Inc.",63.67,$3.32B,2014.0,Technology,Computer Software: Prepackaged Software,http://www.nasdaq.com/symbol/twou,NASDAQ
2,JOBS,"51job, Inc.",62.57,$3.87B,2004.0,Technology,Diversified Commercial Services,http://www.nasdaq.com/symbol/jobs,NASDAQ
3,CAFD,8point3 Energy Partners LP,14.94,$1.18B,2015.0,Public Utilities,Electric Utilities: Central,http://www.nasdaq.com/symbol/cafd,NASDAQ
4,EGHT,8x8 Inc,14.05,$1.29B,,Public Utilities,Telecommunications Equipment,http://www.nasdaq.com/symbol/eght,NASDAQ


In [8]:
# convert last sale price type to numeric
tickers_Billion.loc[:,"LastSale"] = tickers_Billion.loc[:,"LastSale"].astype(float);

# remove $ and B in `MarketCap`
tmp = tickers_Billion.MarketCap.str.replace("$", "")
tmp = tmp.str.replace("B", "")
#tmp.reset_index()

# change marketcap to real numercial values.
tickers_Billion.MarketCap= tmp.astype(float) * 1000000000

tickers_Billion.head()

Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Exchange
0,SRCE,1st Source Corporation,49.88,1290000000.0,,Finance,Major Banks,http://www.nasdaq.com/symbol/srce,NASDAQ
1,TWOU,"2U, Inc.",63.67,3320000000.0,2014.0,Technology,Computer Software: Prepackaged Software,http://www.nasdaq.com/symbol/twou,NASDAQ
2,JOBS,"51job, Inc.",62.57,3870000000.0,2004.0,Technology,Diversified Commercial Services,http://www.nasdaq.com/symbol/jobs,NASDAQ
3,CAFD,8point3 Energy Partners LP,14.94,1180000000.0,2015.0,Public Utilities,Electric Utilities: Central,http://www.nasdaq.com/symbol/cafd,NASDAQ
4,EGHT,8x8 Inc,14.05,1290000000.0,,Public Utilities,Telecommunications Equipment,http://www.nasdaq.com/symbol/eght,NASDAQ


__Select Market Cap larger than 10 billions__

In [9]:
tickers_TenBillion = tickers_Billion.loc[tickers_Billion.MarketCap > 10000000000]

tickers_TenBillion = tickers_TenBillion.reset_index()
tickers_TenBillion.drop(["index"], axis=1, inplace=True)

tickers_TenBillion.head()
tickers_TenBillion.shape

Unnamed: 0,Symbol,Name,LastSale,MarketCap,IPOyear,Sector,industry,Summary Quote,Exchange
0,ATVI,"Activision Blizzard, Inc",64.1,48470000000.0,,Technology,Computer Software: Prepackaged Software,http://www.nasdaq.com/symbol/atvi,NASDAQ
1,ADBE,Adobe Systems Incorporated,182.24,89830000000.0,1986.0,Technology,Computer Software: Prepackaged Software,http://www.nasdaq.com/symbol/adbe,NASDAQ
2,AMD,"Advanced Micro Devices, Inc.",11.38,10980000000.0,,Technology,Semiconductors,http://www.nasdaq.com/symbol/amd,NASDAQ
3,ALXN,"Alexion Pharmaceuticals, Inc.",110.87,24770000000.0,1996.0,Health Care,Major Pharmaceuticals,http://www.nasdaq.com/symbol/alxn,NASDAQ
4,ALGN,"Align Technology, Inc.",253.89,20360000000.0,2001.0,Health Care,Industrial Specialties,http://www.nasdaq.com/symbol/algn,NASDAQ


(666, 9)

#### Save the ticker symbols dataframe to feature (for read in R)

In [11]:
# Write to a csv file
tickers_TenBillion.to_csv("../data/data_Ticker/tickers_TenBillion.csv")

# Write to a feather file for R
feather.write_dataframe(tickers_TenBillion, "../data/data_Ticker/tikers_TenBillion.feather")