# Data Analysis with Python

---

Data Analysis/Data Science helps us answer questions from data.

**Data Analysis** plays an important role:
* Discover useful information
* Answering questions
* Predicting future or the unknown


### Python Packages for Data Science


A python package is a collection of **functions** and **methods** that allow you to perform lots of actions without writing any code.

The libraries usually contain built-in modules providing different functionalities which you can use directly.


#### Three groups of Python data analysis libraries:

1.	**Scientific Computing Libraries**
    a.	**Pandas** → data structure and tools (provides facts, access to structured data). The primary instrument of Pandas is the two-dimensional table consisting of column and row labels, which are called a data frame. It is designed to provide easy indexing functionality.
    b.	**Numpy** → using arrays for its inputs and outputs. It can be extended to objects for matrices and with minor coding changes developers can perform fast array processing.
    c.	**Scipy** → functions for some math advanced problems as well as data visualization.

2.	**Visualization Libraries** are the best way to communicate with others, showing them meaningful result of an analysis to create graphs, charts and maps.
    a.	**Matplotlib** (the most well-known) → great for making graphs and plots
    b.	**Seaborn** → it is based on Matplotlib. Easy to generates varius plots such as heat maps, time series and violin plots.

3.	**Algorithmic Libraries** (machine learning algorithms): to develop a model using our dataset and obtain predictions
    a.	**Scikit-Learn** → contains tools statistical modeling including regression, classification, clustering, … this is built on Numpy, Scipy and Matplotlib
    b.	**Statsmodels** → to explore data, estimate statistical models and perform statistical tests.




### Importing and Exporting Data in Python

How to read any data using Python Pandas Package?

Data acquisitions is a process of loading and reading data into notebook from various sources.
To read any data using Python's Pandas package, there are two important factors to consider, _**format**_ and _**file path**_.

**Format** is the way data is encoded.
Some common encodings: .csv, .json, .xlsx, .hdf, …

**Path** tells us where the data is stored.
Usually, it is stored either on the computer we are using or online on the internet.

Each row is one datapoint.
Because the properties are separated from each other by commas, we can guess the data format is CSV, which stands for comma separated values.


In Pandas the read-csv method can read in files with columns separated by commas into a Panda dataframe.


Reading in Pandas can be done quickly in 3 lines:
1.	Import Pandas
2.	Define a variable with a file path
3.	Use the read-csv method to import the data


In [3]:
import pandas as pd
url = "https://cdn.wsform.com/wp-content/uploads/2020/06/industry.csv"
df = pd.read_csv(url)

read_csv assume the data has a header.
If our data contains no header, we need to specify read-csv not to assign headers by setting headers to none.

In [None]:
# df = pd.read_csv(url, header=None)

After reading the dataset, it is a good idea to look at the data frame to get a better intuition and to ensure that everything occurred the way you expected. Since printing the entire dataset may take up too much time and resources to save time, we can just use dataframe.head to show the first n rows of the data frame.

In [4]:
df.head(5) # show the first 5 rows of data frame

Unnamed: 0,Industry
0,Accounting/Finance
1,Advertising/Public Relations
2,Aerospace/Aviation
3,Arts/Entertainment/Publishing
4,Automotive


Similarly, ```dataframe.tail(n)``` shows the bottom n rows of data frame.

In [5]:
df.tail(5)

Unnamed: 0,Industry
38,Skilled Labor
39,Technology
40,Telecommunications
41,Transportation/Logistics
42,Other


we can assign column names in pandas.
1.	put the column names in a list called headers,
2.	set df.columns equals headers to replace the default integer headers by the list.


In [8]:
headers = ['Industry Name']
df.columns = headers
df.head(5)

Unnamed: 0,Industry Name
0,Accounting/Finance
1,Advertising/Public Relations
2,Aerospace/Aviation
3,Arts/Entertainment/Publishing
4,Automotive


At some point in time, after you've done operations on your dataframe you may want to export your pandas dataframe to a new CSV file. You can do this using the method to_CSV. To do this, specify the file path which includes the file name that you want to write to. For example, if you would like to save dataframe df as automobile.CSV to your own computer, you can use the syntax df.to_CSV

In [9]:
path = "./automobile.csv"
df.to_csv(path)

Pandas also supports importing and exporting of most data file types with different dataset formats.

| Data Format | Read | Save |
|----------|-----------------|--------------|
| csv      | pd.read_csv()   | df.to_csv()  |
| json     | pd.read_json()  | df.to_json() |
| Excel    | pd.read_excel() | df.to_excel()|
| sql      | pd.read_sql()   | df.to_sql()  |




### Getting Started Analyzing in Python

Understand your data before you begin.

Should check:
* Data Types
* Data Distribution


Locate potential issues with the data.

Pandas has several built-in methods that can be used to understand the datatype or features or to look at the distribution of data within the dataset.

Using these methods, gives an overview of the dataset and point out potential issues such as the wrong data type of features which may need to be resolved later.

Data has a variety of types. The main types stored in **Pandas'** objects are **object**, **float**, **Int**, and **datetime**.
The data type names are somewhat different from those in native Python.


| Pandas Type | Native Python Type | Description |
|-------------|--------------------|---------------------------------|
| object      | string             | numbers and strings             |
| int64       | int                | numeric characters              |
| float64     | float              | numeric characters with decimals|
| datetime64, timedalta[ns]        | N/A[but see the datetime module in Python's standard library]   | time data  |



While the datetime Pandas type, is a very useful type for handling time series data.

There are two reasons to check data types in a dataset. Pandas automatically assigns types based on the encoding it detects from the original data table. For several reasons, this assignment may be incorrect.

1.	Potential info and type mismatch
2.	Compatibility with python methods


When the dtype method is applied to the data set, the data type of each column is returned in a series.
In Pandas, we use ```dataframe.dtypes```to check data types.

In [10]:
# Returns a statistical summary
df.describe()

Unnamed: 0,Industry Name
count,43
unique,43
top,Accounting/Finance
freq,1


To enable a summary of all the columns, we could add an argument.

df.describe(include = "all")

 Now, the outcome shows the summary of all the columns, including object typed attributes. We see that for the object type columns, a different set of statistics is evaluated, like unique, top, and frequency. Unique is the number of distinct objects in the column. Top is most frequently occurring object, and freq is the number of times the top object appears in the column.


In [11]:
df.describe(include="all")

Unnamed: 0,Industry Name
count,43
unique,43
top,Accounting/Finance
freq,1


### Accessing Databases with Python

DBMS → Database Management Systems are software systems used to store, retrieve, and run queries on data. A DBMS serves as an interface between an end-user and a database, allowing users to create, read, update, and delete data in the database.


There is a mechanism by which the Python program communicates with the DBMS.

The Python code connects to the database using API calls.

**SQL APIs and Python DB APIs**.

An application programming interface is a set of functions that you can call to get access to some type of service.

The SQL API consists of library function calls as an application programming interface, API, for the DBMS.
To pass SQL statements to the DBMS, an application program calls functions in the API, and it calls other functions to retrieve query results and status information from the DBMS.

The basic operation of a typical SQL API is illustrated in the figure.
![What is a SQL API?](./images/SQL-API.png)





page 6 of the word file "Data Analysis with Python" Data Science Review