### Exploring Data in Python Part 1

#### Exploring your working directory

In [None]:
# Open a file: file
file = open('moby_dick.txt', mode='r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed())

In [None]:
# Read and print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

### Flat Files
* Flat files are basic text files containing records, that is, table data, without structured relationships. The file extenstion for flat files can be .csv for comma-separated values or .txt for text files, and values in flat files can be separated by delimeters other than commas.
* It is important to know whether a flat fike has a header, as it may affect data import. The text also discusses how to import flat files using numpy or pandas and provide examples of tab-delimited files containing numeric or string data.

### Importing flat files using NumPy
* This article discusses the use of NumPy,  aPython package, to import flat files as numpy arrays, which are efficient and essential for other packages like scikit-learn. It explains the use of NumPy functions like loadtxt and genfromtxt to import data, and also covers customizations that can be made, such as delimiters, skipping rows, and selecting columns.
* The article also mentions the difficulty of importing mixed datatypes and suggests using other functions to handle them.

In [None]:
# Import package
import numpy as np

# Assign filename to variable: file
file = 'digits.csv'

# Load file as array: digits
digits = np.loadtxt(file, delimeter = ',')

# Print datatype of digits
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

In [None]:
# What if there are rows, such as a header, that you don't want to import? What if your file has a delimiter other than a comma? What if you only wish to import particular colunns?

# Import numpy as np
file = 'digits_header.txt'

# Load the data: data
data = np.loadtxt(file, delimeter='\t', skiprows=1, usecols=[0,2])

# print data
print(data)

##### Importing different datatypes

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Assign filename: file
file = 'seaslug.txt'

# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)

# print the first element of data
print(data[0])

In [None]:
# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

In [None]:
print(data_float[9])

In [None]:
# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()

##### Working with mixed datatypes

In [None]:
# Assign a filename: file
file = 'titanic_sub.csv'

# import file using np.recfromcsv: d
d = np.recfromcsv(file)

# print out first three entries of d
print(d[:3])

### Importing flat files using pandas
* Discusses the need for a two-dimensional labeled data structure with columns of potentially different types that can be easily manipulated sliced, reshaped, grouped, joined, merged, and analyzed in missing-value-friendly manner, which is fulfilled by Pandas' DataFrame.
* It explains that Pandas is a library for Python that fills the gap between data preparation and data analysis and modeling, and that the most relevant data structure to the data manipulation and analysis workflow that Pandas offers is the DataFrame.
* This section also covers the DataFrame to a NumPy array.
* Finally, it explains that using Pandas to import flat files as DataFrames is standard and best practice in Data Science

In [None]:
import pandas as pd

# Assign the filename: file
file = 'titanic_sub.csv'

# Read the file into a DataFrame: df
df = pd.read_csv(file)

# View the head of the DataFrame
print(df.head())

##### Using pandas to import flat files as DataFrames (2)

In [None]:
# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header=None)

# Build a numpy array from the DataFrame: data_array
data_array = data.values

# Print the datatype of data array to the shell
print(type(data_array))

##### Customizing your pandas import

In [None]:
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values=['Nothing'])

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

### Final Thoughts on Import
* Emphasized the importance of using pandas to import data, although it's useful to be aware of other methods as well. The next chapter will cover importing various file types using pandas, and the author mentions the constant development of new file formats and import methods. The sequel to this course will cover scrapuing data from the web and interacting with APIs.

### Not So Flat Anymore
##### Loading a pickled file
* There are number of dataypes that cannot be saved easily to flat files, such as lists and dictionaries. If you want your files to be human readable, you may want to save them as text files in a clever manner. JSONs, which you will see in a later chapter, are appropriate for Python dictionaries.

In [None]:
# Import pickle package
import pickle

# Open pickle file and load data
with open('data.pkl', 'rb') as file:
    d = pickle.load(file)

# print data
print(d)

# print datatype
print(type(d))

##### Listing sheets in Excel Files
* Whether you like it or not, any working data scientist will need to deal with Excel spreadsheets at some point in time. You won't always want to do so in Excel.

In [None]:
# import pandas
import pandas as pd

# Assign spreadsheet filename: file
file = 'battledeath.xlsx'

# Load spreadsheet: xls
xls = pd.ExcelFile(file)

# print sheet name
print(xls.sheet_names)

##### Importing sheets from Excel Files
* In the previous exercise, you saw that the Excel file contains two sheets, '2002' and '2004'. The next step is to import these.

In [None]:
# Load a sheet into a DataFrame by name: df1
df1 = xls.parse('2004')

# Print the head of the DataFrame df1
print(df1)

# Load a sheet into a DataFrame by index: df2
df2 = xls.parse(0)

# Print the head of the DataFrame df2
print(df2.head())

##### Customizing your spreadsheet import
* here you'll parse your spreadsheet and use additional arguments to skip rows, rename columns and select only particular columns

In [None]:
# Parse the first sheet and rename the columns: df1
df1 = xls.parse(0, skiprows=[0], names=['Country','AAM due to War (2022)'])

# Print the head of the dataframe df1
print(df1.head())

# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country'])

# Print the head of the DataFrame df2
print(df2)

### Importing SAS/Stata files using pandas
* The article concludes by emphasizing the importance of using pandas to import data, although its useful to be aware of other methods as well. The next chapter will cover importing various file types using pandas, and the author mentions the constant development of new file formats and import methods. The sequel to this course will cover scraping data from the web and interacting with APIs.

In [None]:
from sas7bdat import SAS7BDAT
import matplotlib.pyplot as plt

# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
    df_sas = file.to_data_frame()

# Print head pf DataFrame
print(df_sas.head())

# Plot histograms of a DataFrame feature (Pandas and pyplot already imported)
pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('count')
plt.show()

### Using read_stata to import Stata files
* The pandas package has been imported in the environment as pd and the file disarea.dta is in your working directory. The consists of disease for several diseases in various countries.

##### Importing Stata Files
* Here, you'll gain expertise in importing Stata files as DataFrames using the pd.read_stata() functions from pandas.

In [None]:
# Import pandas
import pandas as pd

# Load Stata file into pandas DataFrame: df
df = pd.read_stata('disarea.dta')

# print the head of the DataFrame: df
print(df.head())

In [None]:
# Plot histogram of one column of the DataFrame
pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of Disease')
plt.ylabel('Number of Countries')
plt.show()

##### Using File to import HDF5 Files

In [None]:
# import packages
import numpy as np
import h5py

# Assign filename: file
file = 'LIGO_data.hdf5'
file = 'L-L1_LOSC_4_V1-1126259446-32.hdf5'

# Load the data
data = h5py.File(file, 'r')

# Print the datatype of the loaded file
print(type(data))

# Print the keys of the file
for key in data.keys():
    print(key)

##### Extracting data from your HDF5 file

In [None]:
# Get the HDF5 group: group
group = data['strain']

# Check out the keys of group
for key in group.keys():
    print(key)

# Set variable equal to time series data: strain
strain = np.array(data['strain']['Strain'])

# Set number of time points to sample:
num_samples = 10000

# Set time vector
time = np.arange(0, 1, 1/num_samples)

# Plot data
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()

### Importing MATLAB files
* The focus on importing MATLAB files (.mat) into Python. MATLAB is a numerical computing environment widely used in the fields of engineering and science, and its native file format is .mat. The scipy library provides the loadmat and savemat functions to read and write .mat files in Python
* A .mat file contains a collection of MATLAB objects, such as strings, floats, vectors, and arrays, stored in a MATLAB workspace. When a .mat file is imported into Python using loadmat, it results in a dictionary where the keys are the variable names in the MATLAB workspace, and the values are the corresponding objects assigned to those variables.

##### Loading .mat files

In [None]:
# import package
import scipy.io

# Load MATLAB file: mat
mat = scipy.io.loadmat('ja_data2.mat')

# Print the datatupe of mat
print(type(mat))

##### The structure of .mat in Python
* Here you'll discover what is the MATLAB dictionary that you loaded.

In [None]:
# Print the keys of the MATLAB dictionary
print(mat.keys())

# Print the type of the valye corresponding to the key "CYratioCyt"
print(type(mat['CYratioCyt']))

# Print the shape of the valye corresponding to the key 'CYratioCyt'
print(np.shape(mat['CYratioCyt']))

# Subset the array and plot it
data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluoresence (measure of expression)')
plt.show()



### Creating a database engine in Python
* This article discusses creating a database engine in Python using SQLAlchemy to connect to an SQLite database, and getting table names from the database using the table_names method. The article explains that an SQL engine needs to be created to communicate queries to the database, and then the create_engine function is used to do this. The table_names method is then applied to the engine object to obtain a list of table names, which can be printed to the console.

In [1]:
# import module
from sqlalchemy import create_engine

# Create engine:
engine = create_engine('sqlite:///Chinook.sqlite')

In [2]:
# Save the table names to a list: table names
table_names = engine.table_names()

# Print the table names to the shell
print(table_names)

[]


  table_names = engine.table_names()


### Querying Relational Databases in Python
* How to query relational databases in Python using SQL, SQLAlchemy, and pandas. The basic SQL query "SELECT * FROM Table_Name" is introduced, and the workflow of SQL querying is explained. The steps include importing packages and functions, creating an engine, connecting to the database, querying the database, saving results to a dataframe, and closing the connection. The text also explains how to se the dataframes column's column names and how to use a context manager to open a connection.

In [3]:
# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine:
engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine connection
con = engine.connect()

table_names = engine.table_names()
print(table_names)

# Open engine connection
con = engine.connect()

# Perform query: rs
rs = con.execute('SELECT * FROM Album')

# Save results of the query to a DataFrame: df
df = pd.DataFrame(rs.fetchall())

# Close connection
con.close()

# Print head of Dataframe
print(df.head())

[]


  table_names = engine.table_names()


OperationalError: (sqlite3.OperationalError) no such table: Album
[SQL: SELECT * FROM Album]
(Background on this error at: http://sqlalche.me/e/14/e3q8)

In [4]:
# Open engine in context manager
# Perform query and save results to Dataframe: df

with engine.connect() as con:
    rs = con.execute('SELECT LastName, Title FROM Employee')
    df = pd.DataFrame(rs.fetchmany(size=3))
    df.columns = rs.keys()

# Print the length of the dataframe df
print(len(df))

# Print the head of the dataframe df
print(df.head())


OperationalError: (sqlite3.OperationalError) no such table: Employee
[SQL: SELECT LastName, Title FROM Employee]
(Background on this error at: http://sqlalche.me/e/14/e3q8)

##### Filtering your database records using SQL's WHERE clause

In [5]:
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute('SELECT * FROM Employee WHERE EmployeeID >= 6')
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print head
print(df.head())

OperationalError: (sqlite3.OperationalError) no such table: Employee
[SQL: SELECT * FROM Employee WHERE EmployeeID >= 6]
(Background on this error at: http://sqlalche.me/e/14/e3q8)

##### Ordering your SQL records with ORDER BY
* If you wanted to get all records from the Customer table of the Chinook database and order them in increasing order by the column SupportRepID, you could use:

In [5]:
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine in context manager
with engine.connect() as con:
    rs = con.execute('SELECT * FROM Employee ORDER BY Birthdate')

    df = pd.DataFrame(rs.fetchall())

    # Set the DataFrame's column names
    df.columns = rs.keys()

# Print head
print(df.head())

   EmployeeId  LastName FirstName                Title  ReportsTo  \
0           4      Park  Margaret  Sales Support Agent        2.0   
1           2   Edwards     Nancy        Sales Manager        1.0   
2           1     Adams    Andrew      General Manager        NaN   
3           5   Johnson     Steve  Sales Support Agent        2.0   
4           8  Callahan     Laura             IT Staff        6.0   

             BirthDate             HireDate              Address        City  \
0  1947-09-19 00:00:00  2003-05-03 00:00:00     683 10 Street SW     Calgary   
1  1958-12-08 00:00:00  2002-05-01 00:00:00         825 8 Ave SW     Calgary   
2  1962-02-18 00:00:00  2002-08-14 00:00:00  11120 Jasper Ave NW    Edmonton   
3  1965-03-03 00:00:00  2003-10-17 00:00:00         7727B 41 Ave     Calgary   
4  1968-01-09 00:00:00  2004-03-04 00:00:00          923 7 ST NW  Lethbridge   

  State Country PostalCode              Phone                Fax  \
0    AB  Canada    T2P 5G3  +1 (403)

### Querying relational databases directly with pandas
How to query relational databases directly using pandas after creating a database engine, you can get the results of any particular line using 4 lines of code: connection, executing a query, passing the results to a dataframe, and naming the columns. However, you can do better by utilizing the pandas function read_sql_query and passing it two arguments: the query you wish to make and the engine you want to connect to. This allows you to achieve the same result in a single line of code

##### Pandas and the Hellow World of SQL Queries

In [6]:
# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query('SELECT * FROM Album', engine)

# Print the head of DataFrame
print(df.head())

# Open engine in context manager and store query result in df1
with engine.connect() as con:
    rs = con.execute('SELECT * FROM Album')
    df1 = pd.DataFrame(rs.fetchall())
    df1.columns = rs.keys()

# Confirm that both methods yield the same results
print(df.equals(df1))

   AlbumId                                  Title  ArtistId
0        1  For Those About To Rock We Salute You         1
1        2                      Balls to the Wall         2
2        3                      Restless and Wild         2
3        4                      Let There Be Rock         1
4        5                               Big Ones         3
True


##### Pandas for more complex querying

In [8]:
# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query('SELECT * FROM Employee WHERE EmployeeID >= 6 ORDER BY Birthdate', engine)

# Print the head of DataFrame
df.head()

Unnamed: 0,EmployeeId,LastName,FirstName,Title,ReportsTo,BirthDate,HireDate,Address,City,State,Country,PostalCode,Phone,Fax,Email
0,8,Callahan,Laura,IT Staff,6,1968-01-09 00:00:00,2004-03-04 00:00:00,923 7 ST NW,Lethbridge,AB,Canada,T1H 1Y8,+1 (403) 467-3351,+1 (403) 467-8772,laura@chinookcorp.com
1,7,King,Robert,IT Staff,6,1970-05-29 00:00:00,2004-01-02 00:00:00,590 Columbia Boulevard West,Lethbridge,AB,Canada,T1K 5N8,+1 (403) 456-9986,+1 (403) 456-8485,robert@chinookcorp.com
2,6,Mitchell,Michael,IT Manager,1,1973-07-01 00:00:00,2003-10-17 00:00:00,5827 Bowness Road NW,Calgary,AB,Canada,T3B 0C5,+1 (403) 246-9887,+1 (403) 246-9899,michael@chinookcorp.com


### Advanced Querying: Exploiting Table Relationships
This section explains the concept of table relationships in relational databases and how tables can be joined to obtain information from multiple tables. It specifically introduces the INNER JOIN operation and provides an example of using it to combine information from the "Orders" and "Customers" tables in the Northwind Trader's database. It also mentions that the dot notation is used to select columns from tables and that there are other types of JOIN operations

##### The power of SQL lies in relationships between tables: INNER JOIN

In [9]:
## Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print head of DataFrame df
print(df.head())

                                   Title       Name
0  For Those About To Rock We Salute You      AC/DC
1                      Balls to the Wall     Accept
2                      Restless and Wild     Accept
3                      Let There Be Rock      AC/DC
4                               Big Ones  Aerosmith


##### Filtering your INNER JOIN

In [12]:
# Execute query and store records in DataFrame: df
df = pd.read_sql_query(
    "SELECT * FROM PlaylistTrack INNER JOIN Track ON PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000",
    engine
)

# Print head of DataFrame
df.head()

Unnamed: 0,PlaylistId,TrackId,TrackId.1,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
0,1,3390,3390,One and the Same,271,2,23,,217732,3559040,0.99
1,1,3392,3392,Until We Fall,271,2,23,,230758,3766605,0.99
2,1,3393,3393,Original Fire,271,2,23,,218916,3577821,0.99
3,1,3394,3394,Broken City,271,2,23,,228366,3728955,0.99
4,1,3395,3395,Somedays,271,2,23,,213831,3497176,0.99


### Final Thoughts
You have learned to create engines and connect them to Python, perform SELECT queries, use WHERE to filter results, perform JOIN queries, and store results in pandas dataframes. The article suggests that the reader can now import all types of files in Python!