# Reading and Writing files using Pandas

Pandas is one of the most popular Python libraries which provides a user-friendly interface to reading, presenting and writing files. It also has some additional features, such as plotting, time series analysis, missing value handling etc.

In [1]:
import pandas as pd

# Part 1: reading a .csv file

In [2]:
data_csv = pd.read_csv("titanic.csv")

In [4]:
data_csv.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Part 2: reading .txt files

CSV stands for Comma Separated Values, as the values/variables in .csv files are separated by commas. Similarly, variables/values in .txt filesa are separated by tabs (" "). It is also often called as tab-separated file. To read .txt files in pandas we again use the same **read_csv()** function, yet this time we pass another argument besides name of the file: the separator (which should be a tab/whitespace for .txt file).

In [6]:
data_txt = pd.read_csv("imagine_lyrics.txt", sep=" ")

In [7]:
data_txt.head()

Unnamed: 0,Imagine,by,John,LennonImagine,all,the,"people,",Unnamed: 7
0,living,life,in,peace...,,,,
1,\tJohn,Lennon,,,,,,


# Part 3: reading .html files

Pandas also has a **read_html()** functino similar to **read_csv()**, which reads the html files. All of those functions can read the files directly from the web/url. Let's use the URL of careercenter to read the page content provided in HTML.

In [9]:
data_html = pd.read_html("https://careercenter.am/")

ValueError: No tables found

As you can see we receive an error here. The problem is that the **read_html()** function reads only HTML tables from the website, while no table could be found on careercenter webpage. If you check the source of their website you will see that there is no content. The content is generated trough another file called **ccidxann.php**. This means we should copy the link to that file and scrape it instead.

In [11]:
data_html = pd.read_html("https://careercenter.am/ccidxann.php")

In [12]:
data_html.head()

AttributeError: 'list' object has no attribute 'head'

Now, the head() function can no longer be used, as our data is saved as a list, rather than a dataframe. So let's just print it.

In [13]:
print data_html

[                     0                                                  1
0    JOB OPPORTUNITIES                                                NaN
1                  NaN                     Chief Accountant / Noyan Tapan
2                  NaN  Leading Loan Specialist of Microcredit Block i...
3                  NaN                Senior Internal Auditor / FINCA UCO
4                  NaN                     Credit Officer / Prometey Bank
5                  NaN  Director / Civic Development and Partnership F...
6                  NaN                  Finance Director / Reso Insurance
7                  NaN  FTTB, ADSL/ VDSL Networks Monitoring Technical...
8                  NaN               Digital Platforms Manager / ArmenTel
9                  NaN                           Consultant/ Seller / TST
10                 NaN      Operations Research Developer / Optym Armenia
11                 NaN  Product Manager / Berlin-Chemie Armenian Repre...
12                 NaN               

We may check the length of the list to understand how many elements it has. Basically, each element will be one separate table.

In [14]:
len(data_html)

4

In [15]:
data_html[0]

Unnamed: 0,0,1
0,JOB OPPORTUNITIES,
1,,Chief Accountant / Noyan Tapan
2,,Leading Loan Specialist of Microcredit Block i...
3,,Senior Internal Auditor / FINCA UCO
4,,Credit Officer / Prometey Bank
5,,Director / Civic Development and Partnership F...
6,,Finance Director / Reso Insurance
7,,"FTTB, ADSL/ VDSL Networks Monitoring Technical..."
8,,Digital Platforms Manager / ArmenTel
9,,Consultant/ Seller / TST


In [16]:
data_html[1]

Unnamed: 0,0,1
0,INTERNSHIPS,
1,,Branch Intern / HSBC Bank Armenia
2,,Contact Center Intern / HSBC Bank Armenia


In [17]:
data_html[2]

Unnamed: 0,0,1
0,TRAININGS,
1,,English Language Courses / Career Center


In [18]:
data_html[3]

Unnamed: 0,0,1
0,COMPETITIONS,
1,,Invitation to Bid - ITB/ARM/01/2017 - Sale of ...
2,,Call for Designing Companies for SMEDA Project...


Let's take only the job postings table which had 2 columns as all the others. The first column has only NaN values, so we will chose only the second one and save it as our data for analysis.

In [19]:
data = data_html[0][1]

Now we have a dataframe, which can already be used together with the **head()** and other functions.

In [20]:
data.head()

0                                                  NaN
1                       Chief Accountant / Noyan Tapan
2    Leading Loan Specialist of Microcredit Block i...
3                  Senior Internal Auditor / FINCA UCO
4                       Credit Officer / Prometey Bank
Name: 1, dtype: object

# Part 4: reading other files

Pandas has also functinos for reading Excel, Stata, SAS, JSON, SQL and other files. You may check the [official documentation](http://pandas.pydata.org/pandas-docs/version/0.20/io.html) for details.

# Part 5: writing to files

Writing in Pandas is as easy as reading. You just need to use another function called **to_csv** (in case of CSV files) for writing reason. Let's take a took at it.

In [21]:
data.to_csv("careercenter_data.csv")

We may now go to our folder to check the csv file.