#PYTHON FOR DATA SCIENCE, AI and MACHINE LEARNING#

pandas is a library for the Python programming language used for data science. pandas is largely a math and statistics library that offers a great deal of functionality.
To use pandas in Python we must first import the library.
Typing >import pandas all the time gets tedious so by convention, import calls for usually import it under 
a shorter pseudonym "pd" as seen below.

In [1]:
import pandas as pd # imports the pandas library under the pseudonym "pd".

Now that we have imported the pandas library we have a number of groovy pre-made objects and classes available to us with a range of VERY swanky methods...methods we can use for extracting that sweet, sweet data. Using pandas we can do a lot of very convenient things using Python, for example opening csv files and accessing their information, creating dataframes for data analysis and more. Pandas is a pretty powerful library that offers quite a lot of functionality for data science.

Some of the really groovy classes available in pandas include:
    1) Series objects: Which are essentially a more flexible data array, based on the numpy data array object.
    2) DataFrames: Which is pandas' way of representing data tables or structured data.

At the very basic level, pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. For example:
My_S1 = pd.Series([1,2,3,4,5])

once you have created a Series object you can also use the associated methods to view its values or indices.

In [12]:
My_S1 = pd.Series([0.89,1.33,2.71,3.41,4.67]) #creates a pandas Series object from an array ([])or list [] object.
print(My_S1)
print("The values of My_S1 =", My_S1.values)
print("The index of My_S1 is", My_S1.index)


0    0.89
1    1.33
2    2.71
3    3.41
4    4.67
dtype: float64
The values of My_S1 = [0.89 1.33 2.71 3.41 4.67]
The index of My_S1 is RangeIndex(start=0, stop=5, step=1)


As you can see from the output above the Series object returns the objects stored in the array, alongside their respective indices. We can also call methods on the objects in a series using its attributes like values, index positions (just as we would for a regular python list). We can slice and select sequences of data from our series using the regular indexing syntax of series[0:...]to identify a range.



We can then use these series to construct a table or DataFrame in pandas (more on that later).

Another cool thing that Series objects can do is reassign definitions of the index. That is to say, while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index:
This makes Seriesa bit more flexible and usable than Numpy Arrays. Having indices that correspond to a name, or given ID, can be quite useful for constructing data objects that have a LONG LIST OF VALUES IN THEM. For example if we think of a series or an array as representing an object, the value at index[0] might be that object's AGE, the next value at index[1] might be some other information about the object for example SEX, and so on and so forth until we have a full array or series that reflects some data about a given entity. With the Numpy Array we are stuck with integer indices 0,1,2,3,4,5... whereas with Series objects we can  give each index a name or more easily refereable title to call on when we need some information.

In [20]:
My_S2 = pd.Series([1, 2, 3, 4, 5], index = ["ichi","ni", "san", "yon", "go"]) #creates a series object with labelled indices.  
print(My_S2)
print("The values of My_S2 are =", My_S2.values)
print("The index of My_S2 is", My_S2.index)
print(My_S2["ichi":"san"])


ichi    1
ni      2
san     3
yon     4
go      5
dtype: int64
The values of My_S2 are = [1 2 3 4 5]
The index of My_S2 is Index(['ichi', 'ni', 'san', 'yon', 'go'], dtype='object')
ichi    1
ni      2
san     3
dtype: int64


One way to look at the Pandas Series object is a bit like a specialization of a Python dictionary. A regular dictionary maps arbitrary keys to a set of arbitrary values (arbitrary meaning that the types of both keys/values can be literally anything you want), and a Series is a structure which maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

What this also means is that we can construct a series directly from python dictionaries, simply by using the dictionary as the argument of >pd.Series(dictionary).


In [35]:
My_D1 = {"Boy":"Roger", "Girl":"Lyra", "Place":"Oxford", "Animal":"Pine Marten", "Thing":"Alethiometer"} #defines dictionary
My_S3 = pd.Series(My_D1) #creates series from dictionary
print(My_D1.values())
print(My_S3)
print(My_S3["Place"])
print(My_S3["Girl":"Thing"]) # creates a slice from the series

dict_values(['Roger', 'Lyra', 'Oxford', 'Pine Marten', 'Alethiometer'])
Boy              Roger
Girl              Lyra
Place           Oxford
Animal     Pine Marten
Thing     Alethiometer
dtype: object
Oxford
Girl              Lyra
Place           Oxford
Animal     Pine Marten
Thing     Alethiometer
dtype: object


DataFrames are basically 2 dimensional variations of series. Where with a Series the index is mutable and can be named or titled, The Dataframe extends this capability to both indices by column and by row. In other words, we can think of a Dataframe as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index. 
Look at the example below.


In [42]:
area_pop ={'California': 38332521,'Texas': 26448193,'New York': 19651127,'Florida': 19552860,'Illinois': 12882135}
pop = pd.Series(area_pop)
print(pop, "\n")
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995} #constructs a dictionary of 5 states, and their areas.
area = pd.Series(area_dict) #constructs a series using dictionary
print(area)

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64 

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64


Now that we have two Series objects (pop and area) we can use a dictionary to construct a single two-dimensional object containing this information. This single two dimensional object will be our DataFrame.


In [44]:
states = pd.DataFrame({'population': pop,'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


What we have done above is taken two series that share the same indices (the state names) and stacked them side by side, with appropriate titles to describe what values they represent. This means that the DataFrame object has a regular index object attribute AND another Index object attribute for the columns. This gives the data in our DataFrame two points of reference which we can use to locate and access data. These attributes can be accessed using either DataFrame.index or DataFrame.columns respectively. Using the index position we can get a running list of the series(the columns) and shared indices(rows) which make up the data frame.

Similarly we access data stored in them through their respective index and column key/tit as you can see below.


In [61]:
print("The index of states are ", states.index)
print("The columns of states are", states.columns)
print(states.index[0:],"\n\n", states.columns[0:])
print(states['population'],"\n\n", states['area'])

The index of states are  Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
The columns of states are Index(['population', 'area'], dtype='object')
Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object') 

 Index(['population', 'area'], dtype='object')
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64 

 California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64


A Pandas DataFrame can be constructed in a variety of ways. Here we'll give several examples.

From a single Series object:
A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series using the syntax > pd.DataFrame(Series_name, columns = "column_name") where the column name will reflect whatever title you want to assign to the values by series index.

From a list of dictionary objects:
Any list of dictionaries can be made into a DataFrame. Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., "not a number") values. We can use the syntax >pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}]) and this will create a DataFrame where a, b and c are columns and the values associated with them are stored accordingly.

From a dictionary of Series objects:
As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well. For this we can use the syntax >pd.DataFrame({"series1title" = series1, "series2title" = series2}) where we construct a dictionary using the each series with an appropriate title which defines what data the series represents

From a two-dimensional NumPy array:
Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will be used for each:

From a NumPy Structured Array:
A Pandas DataFrame operates much like a structured array, and can be created directly from one:

##Opening Files in pandas##
pandas is pretty rad in that it is usable with a wide format of structured data. NAturally it can open csv files, but it can also open json, sql, xls, and a bunch of other formats used for storing tabled data and structured data. The method for opening a particular file type is very simple.
> variable = pd.read_fileextension("c://blahblahblah/blahblah/blah/filepath")



In [67]:
precipitation = pd.read_csv('C:/Users/Shael/Documents/IBM Data Science Professional Certificate Specialization/SAved Resources/precipitation.csv')
precipitation.head()

Unnamed: 0,Country or Area,Year,Value,Value Footnotes,Unit
0,Albania,2007,30964.0,,million cubic metres
1,Albania,2006,32380.0,,million cubic metres
2,Albania,2005,42840.0,,million cubic metres
3,Albania,2004,42787.0,,million cubic metres
4,Albania,2003,27893.0,,million cubic metres


One very cool thing you can do with DataFrame objects is create new DataFrame objects out of their constituent Series. To do this we can assign a variable >NewFrame = DataFrame[["column1","column2","column3"]]

In [69]:
PrecipDataFrame = precipitation[["Country or Area","Year","Value"]] # creates a new Dataframe with selected columns
PrecipDataFrame.head()

Unnamed: 0,Country or Area,Year,Value
0,Albania,2007,30964.0
1,Albania,2006,32380.0
2,Albania,2005,42840.0
3,Albania,2004,42787.0
4,Albania,2003,27893.0


Some of the really cool methods that pandas offer include the unique() method, which reports only the unique elements of a given column in a dataframe and ignores duplicates.
the syntax is regular python syntax. > unique_vals = DataFrame["Columnname"].unique()

In [70]:
Countries = PrecipDataFrame["Country or Area"].unique()
for c in Countries:
    print(c)

Albania
Algeria
Andorra
Anguilla
Antigua and Barbuda
Armenia
Azerbaijan
Bahrain
Barbados
Belarus
Belgium
Belize
Benin
Bermuda
Bosnia and Herzegovina
Botswana
British Virgin Islands
Brunei Darussalam
Cameroon
Central African Republic
Chile
China
China, Hong Kong SAR
China, Macao SAR
Colombia
Cote d'Ivoire
Croatia
Cuba
Cyprus
Czech Republic
Denmark
Dominican Republic
Ecuador
Egypt
Estonia
Finland
France
Gambia
Georgia
Germany
Guinea
Hungary
India
Iraq
Israel
Italy
Jamaica
Jordan
Kazakhstan
Kuwait
Kyrgyzstan
Latvia
Lebanon
Lithuania
Luxembourg
Madagascar
Maldives
Malta
Marshall Islands
Mauritius
Monaco
Morocco
Netherlands
Oman
Panama
Paraguay
Poland
Portugal
Qatar
Republic of Moldova
Romania
Senegal
Serbia
Singapore
Slovakia
Slovenia
South Africa
Spain
Sri Lanka
Sweden
Switzerland
Syrian Arab Republic
The Former Yugoslav Rep. of  Macedonia
Togo
Trinidad and Tobago
Tunisia
Turkey
United Kingdom
Venezuela
Yemen
Zimbabwe


We can also select specific rows on the basis of some common value stored in the columns.
For example say we wanted to know create a DataFrame just for the year 2009, and exclude the other values.
WE could in theory write a single line of code that would 1)select only those rows that feature a "Year" value equal to 2009 2)assign the selected values to a new variable as a DataFrame.

In [73]:
Precip2009 = PrecipDataFrame[PrecipDataFrame["Year"]==2009] #creates a variable containing precipitation data for year 2009
print(Precip2009.head())

PrecipBarbados = precipitation[precipitation["Country or Area"]=="Barbados"] #creates variable containing precipitation data for Barbados
print(PrecipBarbados)

    Country or Area  Year       Value  Value Footnotes                  Unit
108        Barbados  1996  656.179993              NaN  million cubic metres
109        Barbados  1995  573.190002              NaN  million cubic metres
110        Barbados  1990  705.630005              NaN  million cubic metres


Now we can save our new dataframes using the syntax > DataFrame.to_csv("newFilename.csv")
Of course as pandas can also export data frames in a number of other formats it is possible to save them as such as well. most of them follow the method format >to_extensiontype().

In [75]:
Precip2009.to_csv("Precipitation_2009.csv")
PrecipBarbados.to_csv("Precipitation_Barbados.csv")