# Pandas - Series

In [1]:
import numpy as np
import pandas as pd

### Creating straight from ndarray. No index passed, creates own
* scalar data must have an index that matches the length
* series works like ndarray - can do similar slciing with it, and np functions on it
* can use Series.to_numpy() to convert to an actual ndarray
* can be named

In [13]:
s = pd.Series(np.random.randn(5), index=[1,2,3,4,5], name='myseries')

In [12]:
s

1   -1.347644
2    1.486419
3   -2.639468
4   -0.137392
5   -0.752758
dtype: float64

In [14]:
s.name

'myseries'

In [9]:
s.index

Int64Index([1, 2, 3, 4, 5], dtype='int64')

### Creating from a dict
* can call values through key like dictionary

In [10]:
d = {'b': 1, 'a': 0, 'c':2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

### Vectorized operations
* can be added, multiplied, numpy functions. 
* **MAIN diff series vs ndarray - series auto aligns data on the label, don't have to consider having the same labels**

# Pandas - Data Frame

### From series or dictionary
* creates columns from keys, and fills with series based on index
* note missing when not specified on missing index

In [15]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
         'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [16]:
df = pd.DataFrame(d)

In [17]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [18]:
pd.DataFrame(d, index=['a','b'])

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0


### From Dictionary of ndarrays or lists
* will auto create index as incrementing ints
* pd.DataFrame(d, index=['a', 'b', 'c', 'd']) to set index name


In [20]:
e = {'one': [1., 2., 3., 4.],
         'two': [4., 3., 2., 1.]}

In [21]:
pd.DataFrame(e)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


### From a series

In [None]:
pd.DataFrame(pd.Series(np.random.randn(5), name='something'))

## Column selection, addition, deletion
* df is dictionary of like-indexed series, works the same as a dictionary
* e.g. 
    * df['one']
    * df['three'] = df['one'] * df['two']
* del df['two']
* df['foo'] = 'bar'
    * autopopulates column foo with bar

In [None]:
# if index is not the same it will be conformed to data frame
# df['one_trunc'] = df['one'][:2] 
# will take first two values and NaN for the rest on that one_trunc 

In [None]:
# can do complete operations on a df 
# e.g. 1/df, df ** 4, etc. 
# boolean operators can be vectorized - 0,1,1 turned to True/False by setting dtype=bool

## dtypes
* categorical, integer, strings, boolean, any (object)
* object catch all for everything 
* df.dtypes for all columns/Series
* df['a'].dtype # for a specific column
* strings can be object or string
* .astype() - changes the type of columns

* changing multiple column types at once:
    * dft1 = dft1.astype({'a': np.bool, 'c': np.float64})

# Pandas Basics

* turning columns to lower: df.columns = [x.lower() for x in df.columns]
* .shape, .index, .columns
* objects either index, series, or dataframe 
* to access contents of column df.a.array 
* df.value_counts() or df.mode() create histogram for 1D array of values


#### Altering labels
* df.reindex([a,b,c]) - reorders the data on new index, puts missing where not labeled/specified
* sharing objects or reindex from a different index rs=s.reindex(df.index)
* used for aligning data 
* for rows and columns at the same time: df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])
* specify axis keyword for row or column, 'index' or 'column'

#### Dropping labels from an axis
* .drop(['a'], axis=0 or axis=1) 0 = row, and 1 = column

#### Renaming
* df.rename(columns={'one': 'foo', 'two': 'bar'},index={'a': 'apple', 'b': 'banana', 'd': 'durian'})
* use dictionary with old name to newname 
* or specify axis ="columns" or "index" 

#### .dt and str accessors
* .dt for changing to datetime.. df.dt.second to extract second etc. (day, dayofweek) 
* dt.tz_localize('US/Eastern') to change timezone
* chaining to convert s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

* .str - S.str.lower() - used to lower case all the strings in that series. 

#### Sort
* .sort_index(), ascending=False, axis=1 or 0
* .sort_values(by='column') 
* can be done to multiple columns df1[['one', 'two', 'three']].sort_values(by=['one', 'two'])
* argument: na_position='first' will put first, otherwise will be last. 
* can .sort_values on column and an index 

------------------------

## SQL environemnts 

* often linked to specific database system: mysql, pgadmin, sqlite
* IDES that work: eclipse, datagrip, SQLworkbenchj, vscode

### Connecting to Databases
* OBDC or JDBC 
* DBMS = Database Management Systems

#### __ODBC__
* open database connectivity 
* defines and performs the work to be done
* access database without needing to recompile
* allow applications to use SQL to access DBMS and data easily
* DBMS connects to database
* ODBC driver - process ODB function calls
    * converts SQL syntax to be read by the DBMS
* inserts a "client driver" - converts app queries to something the DBMS understands. 

#### __JDBC__
* JAVA database connectivity 
* allows Java apps to connecto a relational database. e.g. mysql 
* advantage: provides access to various databases - wtihout needing to develop code for different databases - build own custom SQL statements (standardizes everything?).
* supports ANSI SQL 2003
* supports large number of databases
* Driver - provides link from Java App to database, converts the calls for the databases
* Driver Manager - connect app based on connection string, auto loaded based on the JDBC classpath
    * usually 4.0 but legacy 3.0 with diff syntax
* API - java.sql and javax.sql
<br> 
* Development Process:
* connect to db -> create statement object -> execute SQL query -> process result set
<br>
* Getting a connection:
    * jdbc:driver protocol:driver connection details
    * e.g. jdbc:odbc:DemoDSN jdbc:mysql://localhost:3306/demodb 
* Create a statement object:
    * Statement myStmt = myConn.createStatement(); 
    * used later to exequte the SQL query
* Execute SQL query
    * ReultSet myRs = myStmt.executeQuery("select * from employees");
    * could pass a string that was build before
* Process the result set
    * ResultSet.next() 
        * moves forward one row, return true, if more rows to process
    * to loop:
        * while(myRs.next()) { // read data from each row }
    * can use 'get' for reading data
        * getXX(columnName) or getXX(columnIndex)
        * e.g. println(myRs.getString('first_name')); println(myRs.getString('last_name'))
        * prints the columns