# 8. Data Import 
While pandas has a huge number of ways to read data in and load it into your Python, we focus on plain-text table files, such as csv, and tsv. 
All of the power needed to open plain-text files is contained in a single function: pd.read_csv()

In [3]:
import pandas as pd

In [4]:
import os

os.getcwd()  # get current working directory (cwd)

'/Users/wangxinyuan/Data Analysis with Python'

In [5]:
import pandas as pd

students = pd.read_csv("/Users/wangxinyuan/Desktop/data/students.csv")
print(students)

   Student ID         Full Name      favourite.food             mealPlan   AGE
0           1    Sunil Huffmann  Strawberry yoghurt           Lunch only     4
1           2      Barclay Lynn        French fries           Lunch only     5
2           3     Jayendra Lyne                 NaN  Breakfast and lunch     7
3           4      Leon Rossini           Anchovies           Lunch only     8
4           5  Chidiegwu Dunkel               Pizza  Breakfast and lunch  five
5           6     Güvenç Attila           Ice cream           Lunch only     6


The read CSV function automatically creates a new index, which is just the position of each row, and takes the top line of data as the header or column names. Sometimes, the data do not have column names and you can use names = a list to tell read_csv() to use a different option for the column names.

In [6]:
df_1= pd.read_csv("/Users/wangxinyuan/Desktop/data/students.csv", names = range (5))
df_1

Unnamed: 0,0,1,2,3,4
0,Student ID,Full Name,favourite.food,mealPlan,AGE
1,1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,2,Barclay Lynn,French fries,Lunch only,5
3,3,Jayendra Lyne,,Breakfast and lunch,7
4,4,Leon Rossini,Anchovies,Lunch only,8
5,5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,6,Güvenç Attila,Ice cream,Lunch only,6


You may wish to change which column is used as the index. To do this, use index_col= argument, for example: 

In [7]:
df_2= pd.read_csv("/Users/wangxinyuan/Desktop/data/students.csv", index_col=0)
df_2

Unnamed: 0_level_0,Full Name,favourite.food,mealPlan,AGE
Student ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,8
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6


## 8.2.2 First Steps 
The column names might not be formatted in standard ways. You might consider renaming them. 

In [8]:
from skimpy import clean_columns 
students = clean_columns (students)
students

Unnamed: 0,student_id,full_name,favourite_food,meal_plan,age
0,1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
1,2,Barclay Lynn,French fries,Lunch only,5
2,3,Jayendra Lyne,,Breakfast and lunch,7
3,4,Leon Rossini,Anchovies,Lunch only,8
4,5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
5,6,Güvenç Attila,Ice cream,Lunch only,6


Another task after reading in data is to consider variable types. In the favourite_food column, there are some food items and then the value NaN, which has been read in as a floating point number rather than a missing string. We can solve this by casting that column to explicitly be composed of strings: 

In [9]:
students ["favourite_food"]= students ["favourite_food"].astype ("string")
students

Unnamed: 0,student_id,full_name,favourite_food,meal_plan,age
0,1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
1,2,Barclay Lynn,French fries,Lunch only,5
2,3,Jayendra Lyne,,Breakfast and lunch,7
3,4,Leon Rossini,Anchovies,Lunch only,8
4,5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
5,6,Güvenç Attila,Ice cream,Lunch only,6


Similarly, "age" has mixed data type: string and int

In [10]:
students ["age"]=students ["age"].replace ("five",5)
students ["age"]

0    4
1    5
2    7
3    8
4    5
5    6
Name: age, dtype: object

In a moment, we will turn this into a column of integers too. 

Another example where the data type is wrong is meal_type. This is a categorical variable with a known set of possible values. pandas has a special data type for those: 

Note that the values in the meal_type variable has stayed exactly the same, but the type of variable has changed from the object to category. 

It is a bit tedious to have to go through columns one-by-one as single line assignments to apply type. An alternative is to pass a dictionary that maps column into types, like follow: 

In [11]:
students ["meal_plan"]= students ["meal_plan"].astype ("category")
students ["meal_plan"]

0             Lunch only
1             Lunch only
2    Breakfast and lunch
3             Lunch only
4    Breakfast and lunch
5             Lunch only
Name: meal_plan, dtype: category
Categories (2, object): ['Breakfast and lunch', 'Lunch only']

In [12]:
students = students.astype({"student_id": "int", "full_name": "string", "age": "int"})
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   student_id      6 non-null      int64   
 1   full_name       6 non-null      string  
 2   favourite_food  5 non-null      string  
 3   meal_plan       6 non-null      category
 4   age             6 non-null      int64   
dtypes: category(1), int64(2), string(2)
memory usage: 454.0 bytes


# 8.3. Reading data from multiple files 
Sometimes your data is split across multiple files instead of being contained in one single file. With pd.read_csv() you can read these data in one-by-one and then stack them on top of each other in a single frame using the pd.concat() function. 


In [13]:
## list_of_dataframes = [
##    pd.read_csv(x)
##    for x in ["file_path1","file_path2","file_path3"]
## ]
## sales_files = pd.concat(list_of_dataframes)

# 8.4. Writing to a file 


In [14]:
students.to_csv("/Users/wangxinyuan/Desktop/data/students-clean.csv")

In [15]:
# Let's read it back in and check the info on data types
pd.read_csv("/Users/wangxinyuan/Desktop/data/students-clean.csv").info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      6 non-null      int64 
 1   student_id      6 non-null      int64 
 2   full_name       6 non-null      object
 3   favourite_food  5 non-null      object
 4   meal_plan       6 non-null      object
 5   age             6 non-null      int64 
dtypes: int64(3), object(3)
memory usage: 420.0+ bytes


!!! We lost a lot of the nice data work we did! If you want to save data in a file and have it remember the data types, you'll need to use a different data format. 

In [16]:
students.to_feather("/Users/wangxinyuan/Desktop/data/students-clean.feather")

In [18]:
pd.read_feather("/Users/wangxinyuan/Desktop/data/students-clean.feather").info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   student_id      6 non-null      int64   
 1   full_name       6 non-null      string  
 2   favourite_food  5 non-null      string  
 3   meal_plan       6 non-null      category
 4   age             6 non-null      int64   
dtypes: category(1), int64(2), string(2)
memory usage: 454.0 bytes
