<a href="https://colab.research.google.com/github/Catherine-Elkin/How.To.Be.A.Data.Scientist/blob/main/day_one.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day One: Introduction to Python and Python for Data Analysis
Python is an extremly useful and powerful programming langauge, which has allowed it to become one of the most used langauges by developers. One of the main reasons for this is that it is comparatively easy to understand and learn when compared to other languages. There are also several library packages that can be added, which are particularly useful for data analysis.

## Basics of Python 
Before we look at analysing data with Python, we first need to be introduced to some of the basics. Remember - these are only a small number of the things Python has to offer!

- Variable 
- Data Types 
- Data structures (List and Dictionary)
- Methods 
- Importing Modules

## Variable 
A variable is most basic thing in programming and is used in all programs. A variable stores a value that can be called and changed while the program is running. When we allocate data to variables we use =, you could read is as 'is'. 
e.g.
- x = 10
- x  **is** 10

In [None]:
x = 10 
print(x)

## Data Types 
In nearly all programming languages there are different types of data, such as:
- String 
- Integer
- Boolean 
- Float (Decimal)

Python is a non-strict type language which means unlike other langauges (like java or c++) we do not have to state what the variable type will be. Instead it intuitively works out what the data type is from the data that you give it. Even though Python works this out, we still need to know it too!

In [None]:
#Example storing a university students data
#String 
name = "James Taylor"

#Int 
age = 21

#Float
average_score = 0.8

#Boolean
placement_year = True 

You can check data types using the type() method, which can be useful when evaulating subsets of data

In [None]:
#Shows the types
type(name) , type(age), type(average_score), type(placement_year)

## Data Structures 
There are also some other key data structures that are useful when looking at data analysis, as they can hold mutiple pieces of data:
- Lists 
- Dictionaries  

# List 
A list is a data structure that allows you to store multiple values, such as variables, and then access each variable in the list. 

We tell Python we want to use a list using square brackets [].

In [None]:
#We declare a empty list 
list = []

list = [1,2,3,4,5,6,7,8,9]

When accessing something in a list, we must give its the position within the list (sometimes referred to as the **index**).

In programming, we start the index of a list at 0. This is called **zero based indexing**. If you try and search for a value in the list that is not there, you will produce errors, and the code may not run.


In [None]:
#Access and store the first value of the list in a variable 
item_one = list[0]

#We can also change the values of the list 
list[0] = 100

# Dictionaries 
Sometimes we may want to retrieve data without having to remember the index of the list - this is where dictionaries come in. 

They take two values - the **key** and the **value**. We tell python we want to use a dictionary by using curly brackets {}.

In [None]:
students = {
    'name':'James Taylor',
    'age':21,
    "placement_year":True
}

students

In [None]:
#We can change the values linked to the key with this:
students['name'] = 'John Taylor'

students

In [None]:
#We can also have a list stored in the values section
students = {
    'name':['James Taylor', 'Georgia Smith'],
    'age':[21, 24],
    "placement_year":[True, False]
}

students

## Methods and Importing Modules 
These are the final things that you need to know about Python to be able to take on the rest of this data analysis course. 

### Methods
Think of this as a wrapper that you put around some code. You can then call this and run the code, allowing for repetition. This is a really useful thing that programmers use to make writing code easier and more efficient. You can write you own methods, but Python also comes with some premade methods, which means we don't have to do as much work!

All the things we are going to use in the **pandas** library are methods that have already been made for us to use - we are simply calling them.

In [None]:
#This is an example of a method that we could write ourselves
def hello(name):
    print(f'Hello + {name}')

hello('Charlie')

In [None]:
#The print() command is a method that has being created by someone that we can use
print('This is an example of a method made by someone else ')

### Importing Modules 

Importing modules (**libraries**) is one of the most important features of a good programming language, as they allow language to take on different tasks. 

For example, without importing the pandas library, we would not be able to carry out as much data analysis. Importing libraries simply means we can access the code. 

NOTE: We would usually have to install libraries onto our devices to be able to run them, but anaconda3 and google collab come pre-installed with the libraries we will need!

In [None]:
#An example of importing the Random module and giving it the name r 
import random as r 

#Using the randint method that prints a random number from a range given
print(r.randint(0,10))

# Basics of Data analysis: importing the data and useful methods
For data manipulation and analysis, the main modules that are needed are the **pandas** and **numpy** libraries. In this notebook, the pandas library will be used to import and manipulte the data. Due to the size of the library, there are lots of different methods and functions that can be used. 

To find out more about this library, you can look at the documentation: https://pandas.pydata.org/docs/user_guide/index.html

In [None]:
#The module is imported and given the alias -  pd 
import pandas as pd 

#If you want to run all code locally then uncomment this line and make sure the data is in the correct place
data = pd.read_csv('Data/titanic.csv')

#CSV file data is read into file and stored in the variable data
#This is being pulled from an online github - so can only be run when connected to internet
#data = pd.read_csv('https://raw.githubusercontent.com/chroadhouse/Futureme/main/Data/titanic.csv')

# Inital look at the data 
When importing data, it's good to do some initial analysis to familiarise yourself: 

* **.head()** - shows a small amount of the data

* **.columns** - shows the names of the columns in tables 

* **.describe()** - displays information about quantitative data such as the mean, max, count and minimum (but these values are not stored)

* **.info()** - gives you a summary of the data set

* **.dtypes()** - shows the different data types for the column names 

* **.shape** - gives you the shape of the database in terms of rows and columns 

In [None]:
#This is usually run to ensure that the data set has imported correctly into the file
data.head()

In [None]:
#Prints the columns of the dataset
data.columns

In [None]:
#Gives you the shape of the dataset in (rows,columns)
data.shape

In [None]:
#Gives back the info of each column
data.info()

In [None]:
#Returns the stats for all numerical columns
data.describe()

In [None]:
#Gives you the data types for each column 
data.dtypes