#Text file processing in Python

There two main groups of files stored on a computer, text files and binary files. Text files are human readable, usually edited by notepad, notepad++, etc (for examle .txt, .csv, .html, .xml files are text files). Binary files are created/read by special programs (for example .jpg, .exe, .las, .doc, .xls).
Text files consist of lines, lines are separated by eond of line markers (EOL).

Operating system | EOL marker
-----------------|------------
Windows          | \r\n 
Linux/Unix       | \n 
OS X             | \r 

##Type of text files

Have you seen such files shown in the following?

---
CSV file with header line, comma separeted, fixed record structure
```
Psz,X,Y,Z,
11,91515.440,2815.220,111.920
12,90661.580,1475.280,
13,84862.540,3865.360,
14,91164.160,4415.080,130.000
15,86808.180,347.660,
16,90050.240,3525.120,
231,88568.240,2281.760,
232,88619.860,3159.880,
5001,,,100.000
5002,,,138.800
...
```

---
Stanford Triangle Format (Polygon File Format) point clouds and meshes, several header lines, space separated records with fixed structure
```
ply
format ascii 1.0
element vertex 1978561
property float x
property float y
property float z
property float nx
property float ny
property float nz
property uchar diffuse_red
property uchar diffuse_green
property uchar diffuse_blue
end_header
0.445606 -10.6263 16.0626 -0.109425 -0.0562636 -0.992401 63 68 83
0.460964 -10.6142 16.0604 -0.255715 -0.00303709 -0.966747 43 52 72
0.434582 -10.4337 16.0433 -0.252035 0.171206 -0.952453 32 36 44
0.449782 -10.3186 16.0506 -0.175198 -0.0186472 -0.984357 40 42 53
...
```
---
ESRI ASCII GRID format, six header lines, space separated, fixed record structure
```
ncols 11
nrows 9
xllcorner 576540
yllcorner 188820
cellsize 30
nodata_value -9999
-9999 -9999 139.37 139.81 140.77 141.97 143.32 144.16 -9999
-9999 137.29 137.61 138.00 138.93 140.02 141.40 141.60 140.81
-9999 135.78 135.69 135.89 137.04 138.25 139.44 139.76 139.19
133.94 134.15 133.98 134.03 135.28 136.79 137.69 137.92 137.87
132.76 132.77 132.99 132.58 133.76 135.16 135.73 135.77 135.80
131.76 131.53 131.64 130.81 132.26 133.44 133.85 133.93 -9999
-9999 -9999 130.75 130.15 130.52 132.00 132.46 -9999 -9999
...
```

---
Leica GSI file, fixed field width, space separated

```
*110001+0000000000002014 81..10+0000000000663190 82..10+0000000000288540 83..10-0000000000001377
*110002+0000000000002015 81..10+0000000000649270 82..10+0000000000319760 83..10-0000000000000995
*110003+0000000000002019 81..10+0000000000593840 82..10+0000000000253050 83..10-0000000000001486
*110004+0000000000002020 81..10+0000000000562890 82..10+0000000000274730 83..10-0000000000001309
*110005+00000000000000AE 81..10+0000000000664645 82..10+0000000000245619 83..10+0000000000001505
*110006+00000000000000EL 81..10+0000000000714787 82..10+0000000000300190 83..10+0000000000002396
*110007+00000000000000HK 81..10+0000000000633941 82..10+0000000000269764 83..10+0000000000000362
...
```

---
GeoJSON file, free format, label - value pairs, vectors, hierachical structure
```
{ "type": "FeatureCollection",
"features": [
{ "type": "Feature",
"geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
"properties": {"prop0": "value0"}
},
{ "type": "Feature",
"geometry": {
"type": "LineString",
"coordinates": [
[102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
]
},
"properties": {
"prop0": "value0",
"prop1": 0.0
...
```

---
GML (XML), free format, hierachical structure, tags, international standard
```
<?xml version="1.0" encoding="utf-8" ?>
<ogr:FeatureCollection
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://ogr.maptools.org/ xxx.xsd"
     xmlns:ogr="http://ogr.maptools.org/"
     xmlns:gml="http://www.opengis.net/gml">
  <gml:boundedBy>
    <gml:Box>
      <gml:coord><gml:X>632897.91</gml:X><gml:Y>134104.66</gml:Y></gml:coord>
      <gml:coord><gml:X>636129.8</gml:X><gml:Y>138914.58</gml:Y></gml:coord>
    </gml:Box>
  </gml:boundedBy>
  <gml:featureMember>
    <ogr:xxx fid="xxx.0">
      <ogr:geometryProperty><gml:Point srsName="EPSG:23700"><gml:coordinates>635474.17,137527.75</gml:coordinates></gml:Point></ogr:geometryProperty>
...
```

##Processing patterns

In automated processing of text files command line interface and command line parameters are used. No need for GUI (Graphical User Interface), no user to communicate with.

**Redirection of standard input and output**

```
 -------        ------------        --------
| input |      | processing |      | output |
| file  | ---> | script/prg | ---> | file   |
 -------        ------------        --------
 ```
 command input_file(s) > output_file
 
 command < input_file > output file

**Redirection and pipes**

 ```
 -------        ------------        ------------            --------
| input |      | processing |      | processing |          | output |
| file  | ---> | 1st step   | ---> | 2nd step   | ---> ... | file   |
 -------        ------------        ------------            --------
 ```
 command1 input_file(s) | command2 > output_file

 command1 < input file  | command2 > output_file

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://github.com/OSGeoLabBp/tutorials/blob/master/english/data_processing/lessons/images/file_proc.png?raw=true")

Example to add ordinal number to rows
```
with open('file_name') as fp:
  for line in fp:
    print(i, line)
    i += 1
```
Try the code above on your machine with a local file. (We can't run this code on colab, as we have no local data files.)

Let's find the bounding box from the coordinates stored in a CSV file. Fields are separated by comma. Few lines from the file:

```
548025.89,5129282.50,1008.79
548026.41,5129284.81,1009.49
548026.81,5129270.56,1005.94
548027.89,5129275.27,1007.15
548029.48,5129282.28,1009.18
548031.57,5129291.52,1011.97
548032.78,5129290.76,1012.10
548031.22,5129283.80,1010.00
```
We will use pandas.

In [None]:
import pandas as pd
names = ['east', 'north', 'elev']
data = pd.read_csv('https://raw.githubusercontent.com/OSGeoLabBp/tutorials/master/english/data_processing/lessons/code/lidar.txt', sep=',', names=names)
mi = data.min()
ma = data.max()
print(mi['east'], ma['east'], mi['north'], ma['north'], mi['elev'], ma['elev'])

548025.89 550424.1 5128996.49 5129293.08 933.31 1139.11


Pandas handles data set of records, each record has an index.

In [None]:
data.iloc[[0]]

Unnamed: 0,east,north,elev
0,548025.89,5129282.5,1008.79
