<a href="https://colab.research.google.com/github/Jeancjudy/PLP6235C/blob/main/alpha_numeric_unix.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Working with alpha numeric data in unix.

## **Finding "Alien Genes" in the Plant Pathogen Streptomyces scabies**

*Streptomyces scabies* is a plant pathogen that causes necrosis in potatoes. Most of the virulence factors are located in regions with low GC content. Additionally, virulence is highly expressed during the interaction with plant roots. The file "aliens_in_scabies" contains a table with more than 1000 genes (in the first column) showing their change in expression levels when comparing growth in rich medium versus interaction with roots (second column). The GC content of each gene sequence is also provided (third column).

This analysis will focus on using Unix commands to:
1. Sort the table by gene expression levels.
2. Create a new file with the top 10 highest expressed genes.
3. Sort the top 10 genes by their GC content.
4. Generate a new table with only the gene names, replacing "SCAB" with "SCABIES".
5. Perform statistical analysis of the GC content and expression levels (min, max, average, and median).


Task 1: Sort the table by expression levels (second column)

In [2]:
from google.colab import files

# Upload the 'aliens_in_scabies' file
uploaded = files.upload()

for fn in uploaded.keys():
  print(f'User uploaded file "{fn}" with length {len(uploaded[fn])} bytes')

Saving aliens_in_scabies to aliens_in_scabies
User uploaded file "aliens_in_scabies" with length 15204 bytes


In [3]:
# Sorting by the second column (expression levels) in descending order
!sort -k2,2nr aliens_in_scabies > sorted_by_expression.txt

# Viewing the sorted file
!cat sorted_by_expression.txt

scab164	50	65
scab397	50	65
scab417	50	69
scab446	50	79
scab72	50	73
scab231	49	67
scab263	49	70
scab318	49	73
scab566	49	55
scab586	49	44
scab605	49	42
scab607	49	43
scab6	49	64
scab736	49	54
scab759	49	53
scab9	49	61
scab119	48	73
scab567	48	74
scab593	48	44
scab865	48	61
scab871	48	62
scab890	48	79
scab991	48	64
scab15	47	64
scab166	47	63
scab191	47	67
scab301	47	69
scab38	47	70
scab424	47	72
scab808	47	66
scab969	47	70
scab447	46	64
scab506	46	76
scab639	46	49
scab68	46	71
scab952	46	69
scab959	46	61
scab139	45	45
scab197	45	42
scab260	45	78
scab290	45	69
scab342	45	71
scab353	45	74
scab365	45	80
scab453	45	70
scab478	45	65
scab565	45	78
scab647	45	46
scab810	45	41
scab845	45	53
scab86	45	42
scab965	45	79
scab97	45	65
scab102	44	75
scab226	44	77
scab36	44	62
scab492	44	75
scab600	44	58
scab60	44	63
scab751	44	66
scab835	44	51
scab951	44	64
scab13	43	70
scab160	43	72
scab270	43	79
scab608	43	58
scab712	43	69
scab786	

Task 2: Extract the top 10 highest expressed genes

In [4]:
# Extracting the top 10 highest expressed genes
!head -n 10 sorted_by_expression.txt > top_10_genes.txt

# Viewing the top 10 highest expressed genes
!cat top_10_genes.txt


scab164	50	65
scab397	50	65
scab417	50	69
scab446	50	79
scab72	50	73
scab231	49	67
scab263	49	70
scab318	49	73
scab566	49	55
scab586	49	44


Task 3: Sort the top 10 genes by GC content (third column)


In [5]:
# Sorting the top 10 genes by the third column (GC content) in descending order
!sort -k3,3nr top_10_genes.txt > top_10_sorted_by_gc.txt

# Viewing the top 10 genes sorted by GC content
!cat top_10_sorted_by_gc.txt

scab446	50	79
scab318	49	73
scab72	50	73
scab263	49	70
scab417	50	69
scab231	49	67
scab164	50	65
scab397	50	65
scab566	49	55
scab586	49	44


Task 4: Replace 'SCAB' with 'SCABIES' in gene names


In [6]:
# Replacing "scab" with "SCABIES" in gene names
!cut -f1 aliens_in_scabies | sed 's/scab/SCABIES/g' > genes_replaced.txt

# Viewing the gene names with "SCABIES"
!cat genes_replaced.txt

SCABIES1
SCABIES2
SCABIES3
SCABIES4
SCABIES5
SCABIES6
SCABIES7
SCABIES8
SCABIES9
SCABIES10
SCABIES11
SCABIES12
SCABIES13
SCABIES14
SCABIES15
SCABIES16
SCABIES17
SCABIES18
SCABIES19
SCABIES20
SCABIES21
SCABIES22
SCABIES23
SCABIES24
SCABIES25
SCABIES26
SCABIES27
SCABIES28
SCABIES29
SCABIES30
SCABIES31
SCABIES32
SCABIES33
SCABIES34
SCABIES35
SCABIES36
SCABIES37
SCABIES38
SCABIES39
SCABIES40
SCABIES41
SCABIES42
SCABIES43
SCABIES44
SCABIES45
SCABIES46
SCABIES47
SCABIES48
SCABIES49
SCABIES50
SCABIES51
SCABIES52
SCABIES53
SCABIES54
SCABIES55
SCABIES56
SCABIES57
SCABIES58
SCABIES59
SCABIES60
SCABIES61
SCABIES62
SCABIES63
SCABIES64
SCABIES65
SCABIES66
SCABIES67
SCABIES68
SCABIES69
SCABIES70
SCABIES71
SCABIES72
SCABIES73
SCABIES74
SCABIES75
SCABIES76
SCABIES77
SCABIES78
SCABIES79
SCABIES80
SCABIES81
SCABIES82
SCABIES83
SCABIES84
SCABIES85
SCABIES86
SCABIES87
SCABIES88
SCABIES89
SCABIES90
SCABIES91
SCABIES92
SCABIES93
SCABIES94
SCABIES95
SCABIES96
SCABIES97
SCABIES98
SCABIES99
SCABIES100
SCABIES1

Task 5: Calculate basic statistics for expression levels and GC content
Expression Levels Statistics

Expression Levels Statistics

In [7]:
# Calculating min, max, and average of expression levels
!awk '{sum+=$2; count+=1; if(min==""){min=max=$2}; if($2>max){max=$2}; if($2<min){min=$2}} END {print "Min expression:", min; print "Max expression:", max; print "Average expression:", sum/count}' aliens_in_scabies


Min expression: -50
Max expression: 50
Average expression: -0.886


GC Content Statistics


In [8]:
# Calculating min, max, and average of GC content
!awk '{sum+=$3; count+=1; if(min==""){min=max=$3}; if($3>max){max=$3}; if($3<min){min=$3}} END {print "Min GC content:", min; print "Max GC content:", max; print "Average GC content:", sum/count}' aliens_in_scabies


Min GC content: 40
Max GC content: 80
Average GC content: 65.529


**Little more complex stats**

Range-Based Grouping (Numerical Tasks)
:  Group genes by expression levels or GC content ranges: You can create bins based on certain ranges of expression levels or GC content (e.g., genes with GC content between 50% and 60%, expression levels between -10 and 10).
Find extreme values: Identify genes with extreme GC content or expression levels (e.g., top 1%, lowest 1%, or within a specific percentile).

In [9]:
# Find genes with GC content between 60% and 70%
!awk '$3 >= 60 && $3 <= 70' aliens_in_scabies

scab2	-13	67
scab3	11	60
scab6	49	64
scab7	-42	63
scab9	49	61
scab10	6	65
scab11	-13	65
scab14	-16	65
scab15	47	64
scab17	-36	66
scab18	29	68
scab22	8	69
scab23	-43	69
scab25	9	66
scab29	-41	60
scab31	-30	63
scab32	8	64
scab35	31	66
scab36	44	62
scab40	-34	60
scab44	13	68
scab46	-24	67
scab47	-11	60
scab51	-36	66
scab55	28	62
scab58	34	68
scab60	44	63
scab62	-38	60
scab64	-11	62
scab66	16	60
scab69	24	69
scab71	33	61
scab76	1	60
scab77	13	68
scab79	-29	60
scab83	-44	60
scab84	-3	61
scab92	-49	67
scab96	41	62
scab97	45	65
scab108	36	69
scab109	-27	64
scab122	16	64
scab124	21	63
scab125	33	69
scab126	-10	60
scab127	-29	69
scab133	39	66
scab134	42	68
scab136	12	69
scab147	-44	64
scab148	40	60
scab150	5	69
scab151	5	65
scab159	-40	65
scab161	23	63
scab163	-29	67
scab164	50	65
scab165	29	65
scab166	47	63
scab168	39	62
scab170	-30	65
scab171	-3	61
scab172	14	66
scab175	-1	64
scab176	-37	62
scab181	-16	64
scab187	-8	65
scab19

In [10]:
# Find genes with expression levels between -10 and 10
!awk '$2 >= -10 && $2 <= 10' aliens_in_scabies

scab4	-8	71
scab10	6	65
scab19	-5	78
scab22	8	69
scab25	9	66
scab30	3	74
scab32	8	64
scab34	-3	73
scab39	-10	77
scab43	-10	70
scab48	4	75
scab52	9	79
scab57	-8	79
scab65	-2	73
scab70	-9	70
scab75	5	74
scab76	1	60
scab84	-3	61
scab85	-5	41
scab89	-9	48
scab93	-9	56
scab103	1	49
scab104	-3	49
scab111	0	73
scab113	1	56
scab126	-10	60
scab128	6	72
scab129	-3	76
scab138	9	59
scab140	4	73
scab145	9	49
scab149	1	76
scab150	5	69
scab151	5	65
scab156	-6	43
scab171	-3	61
scab175	-1	64
scab187	-8	65
scab190	2	66
scab194	-3	67
scab199	-2	73
scab203	9	47
scab204	8	72
scab208	-6	64
scab210	-10	71
scab211	-1	45
scab221	-3	66
scab224	-8	69
scab227	-10	80
scab229	-2	68
scab234	5	69
scab244	-4	74
scab250	-5	63
scab251	-10	71
scab253	10	63
scab255	2	60
scab265	-10	61
scab271	-9	69
scab272	10	73
scab273	-8	60
scab277	9	70
scab278	2	74
scab285	10	69
scab291	7	78
scab292	-5	77
scab303	-9	69
scab317	5	62
scab320	0	62
scab327	1	74
scab332	-9

In [11]:
# Find genes that start with "SCAB2" and have a positive expression level
!awk '$1 ~ /scab2/ && $2 > 0' aliens_in_scabies

scab20	20	79
scab21	15	75
scab22	8	69
scab25	9	66
scab28	16	79
scab203	9	47
scab204	8	72
scab206	27	53
scab207	29	61
scab214	15	77
scab216	36	59
scab220	31	77
scab225	31	79
scab226	44	77
scab228	42	77
scab230	19	75
scab231	49	67
scab233	24	69
scab234	5	69
scab241	17	71
scab243	19	71
scab247	40	75
scab248	33	69
scab249	12	64
scab252	16	67
scab253	10	63
scab255	2	60
scab256	34	65
scab257	16	71
scab260	45	78
scab262	19	68
scab263	49	70
scab264	39	78
scab266	27	63
scab267	27	78
scab270	43	79
scab272	10	73
scab276	39	63
scab277	9	70
scab278	2	74
scab280	20	75
scab281	12	67
scab283	24	71
scab285	10	69
scab288	25	71
scab289	35	77
scab290	45	69
scab291	7	78
scab294	42	77
scab295	37	73
scab296	31	66


In [12]:
# List genes where high GC content corresponds to high expression levels
!awk '$2 > 10 && $3 > 70' aliens_in_scabies

scab1	33	70
scab5	31	77
scab13	43	70
scab16	17	80
scab20	20	79
scab21	15	75
scab28	16	79
scab38	47	70
scab41	36	71
scab42	16	80
scab45	28	78
scab49	29	79
scab50	38	75
scab54	15	71
scab67	11	76
scab68	46	71
scab72	50	73
scab91	17	80
scab95	22	71
scab102	44	75
scab107	29	80
scab119	48	73
scab135	39	77
scab152	15	75
scab157	24	75
scab160	43	72
scab178	42	73
scab179	40	79
scab185	26	73
scab189	32	77
scab198	22	80
scab214	15	77
scab220	31	77
scab225	31	79
scab226	44	77
scab228	42	77
scab230	19	75
scab241	17	71
scab243	19	71
scab247	40	75
scab257	16	71
scab260	45	78
scab263	49	70
scab264	39	78
scab267	27	78
scab270	43	79
scab280	20	75
scab283	24	71
scab288	25	71
scab289	35	77
scab294	42	77
scab295	37	73
scab305	19	75
scab306	16	71
scab307	11	72
scab311	34	70
scab314	24	80
scab316	29	75
scab318	49	73
scab325	16	72
scab326	42	79
scab328	42	78
scab333	20	76
scab335	11	80
scab336	12	76
scab342	45	71
scab353	45	74
scab358	33	71
s